Back to articles
Articles
Volume: 6 | Article ID: art00023
Image
A System for Automated Extraction of Metadata from Scanned Documents using Layout Recognition and String Pattern Search Models
  DOI :  10.2352/issn.2168-3204.2009.6.1.art00023  Published OnlineJanuary 2009
Abstract

One of the most expensive aspects of archiving digital documents is the manual acquisition of context-sensitive metadata useful for the subsequent discovery of, and access to, the archived items. For certain types of textual documents, such as journal articles, pamphlets, official government records, etc., where the metadata is contained within the body of the documents, a cost effective method is to identify and extract the metadata in an automated way, applying machine learning and string pattern search techniques.At the U.S. National Library of Medicine (NLM) we have developed an automated metadata extraction (AME) system that employs layout classification and recognition models with a metadata pattern search model for a text corpus with structured or semi-structured information. A combination of Support Vector Machine and Hidden Markov Model is used to create the layout recognition models from a training set of the corpus, following which a rule-based metadata search model is used to extract the embedded metadata by analyzing the string patterns within and surrounding each field in the recognized layouts.In this paper, we describe the design of our AME system, with focus on the metadata search model. We present the extraction results for a historic collection from the Food and Drug Administration, and outline how the system may be adapted for similar collections. Finally, we discuss some ongoing enhancements to our AME system.

Subject Areas :
Views 5
Downloads 0
 articleview.views 5
 articleview.downloads 0
  Cite this article 

Dharitri Misra, Siyuan Chen, George R. Thoma, "A System for Automated Extraction of Metadata from Scanned Documents using Layout Recognition and String Pattern Search Modelsin Proc. IS&T Archiving 2009,  2009,  pp 107 - 112,  https://doi.org/10.2352/issn.2168-3204.2009.6.1.art00023

 Copy citation
  Copyright statement 
Copyright © Society for Imaging Science and Technology 2009
72010361
Archiving Conference
archiving
2161-8798
Society of Imaging Science and Technology
7003 Kilworth Lane, Springfield, VA 22151, USA