Metadata Extraction from Office Documents

William K. Stumbo; John C. Handley

doi:10.2352/issn.2168-3204.2005.2.1.art00040

Abstract

This paper focuses on using layout-based techniques to automatically extract metadata when scanning office documents to an archive. Many office documents such as letters, inter-office memos, and invoices contain key information that is spatially arranged. Information arrayed in this manner is easy for a reader to identify and understand. However, location of information within office documents varies greatly between documents, unlike forms where layout is static. This poses a challenge for layout based metadata extraction techniques. Our system uses regular expression matching and stochastic grammars on lines of text to efficiently and accurately label text according to function, enabling archived documents to be precisely retrieved.

72010361

Archiving Conference

archiving

2161-8798

Society of Imaging Science and Technology

7003 Kilworth Lane, Springfield, VA 22151, USA

2161-8798(20050101)2005:1L.184;1-

ac_v2005n1/splitsection40.xml

/ist/ac/2005/00002005/00000001/art00040

Articles

Metadata Extraction from Office Documents

StumboWilliam K.

HandleyJohn C.

01012005

2005

184

187

2005