Use of Descriptive Metadata as a Knowledgebase for Analyzing Data in Large Textual Collections

Dharitri Misra; George R. Thoma

doi:10.2352/issn.2168-3204.2013.10.1.art00042

Descriptive metadata, such as an article's title, authors, institutional affiliations, keywords and date of publication, collected either manually or automatically from documents contents, is often used to search and retrieve relevant documents in an archived collection. This metadata, especially for a large text corpus such as a biomedical collection, may encapsulate patterns, trends, and other valuable information, usually revealed by using specialized data analysis software to answer specific questions. A more useful, generalized approach is to repurpose this metadata to serve as a knowledgebase to answer appropriate semantic queries.At the US National Library of Medicine (NLM), we recently archived a large biomedical collection comprising annual conference proceedings containing research findings on cholera, conducted between the years 1960-2011 under the “US-Japan Cooperative Medical Science Program” (CMSP). This program was established to address health problems in Southeast Asia and other developing countries. An R&D information management system developed at NLM, called “System for the Preservation of Electronic Resources” (SPER), automatically extracted descriptive metadata from this text corpus and built a DSpace-based archive for accessing the conference articles. SPER also used this metadata to get detailed information regarding the CMSP research community, timelines of important drugs and discoveries and international collaboration, etc., using special purpose data analysis software.In this paper, we describe the occurrence and extraction of metadata from the CMSP document set, and present an alternative approach in which this metadata is used to build a knowledgebase to support semantic queries about the CMSP Program. Specifically, we show the OWL-based hierarchical ontology model created to represent the CMSP Program with its publications, participants and international collaboration over time. We discuss the technique used to convert the extracted metadata from relational database tables to OWL/RDF assertions suitable for supporting semantic queries. We show examples of queries performed against this CMSP knowledgebase, and discuss some scalability issues. Finally we describe how this approach could be customized for other large textual collections, including one from the Food and Drug Administration previously archived by the SPER system.