Back to articles
Articles
Volume: 2 | Article ID: art00040
Image
Metadata Extraction from Office Documents
  DOI :  10.2352/issn.2168-3204.2005.2.1.art00040  Published OnlineJanuary 2005
Abstract

This paper focuses on using layout-based techniques to automatically extract metadata when scanning office documents to an archive. Many office documents such as letters, inter-office memos, and invoices contain key information that is spatially arranged. Information arrayed in this manner is easy for a reader to identify and understand. However, location of information within office documents varies greatly between documents, unlike forms where layout is static. This poses a challenge for layout based metadata extraction techniques. Our system uses regular expression matching and stochastic grammars on lines of text to efficiently and accurately label text according to function, enabling archived documents to be precisely retrieved.

Subject Areas :
Views 0
Downloads 0
 articleview.views 0
 articleview.downloads 0
  Cite this article 

William K. Stumbo, John C. Handley, "Metadata Extraction from Office Documentsin Proc. IS&T Archiving 2005,  2005,  pp 184 - 187,  https://doi.org/10.2352/issn.2168-3204.2005.2.1.art00040

 Copy citation
  Copyright statement 
Copyright © Society for Imaging Science and Technology 2005
72010361
Archiving Conference
archiving
2161-8798
Society of Imaging Science and Technology
7003 Kilworth Lane, Springfield, VA 22151, USA