Back to articles
Articles
Volume: 5 | Article ID: art00040
Image
Representation of Digitized Documents Using Document Specific Alphabets and Fonts
  DOI :  10.2352/issn.2168-3204.2008.5.1.art00040  Published OnlineJanuary 2008
Abstract

Today's digitization efforts lead to huge collections of scanned documents. However, the means for automatic preparation and further processing especially of ancient documents are still limited. In this paper, progress and implementation details of a framework for handling machine printed documents without traditional OCR-methods are shown. The approach is based on deriving any information needed for encoding directly from the original itself. This is achieved by extracting document specific alphabets and corresponding fonts. In particular, it is reported on how preprocessing, text segmentation, alphabet extraction, font generation, document encoding, as well as the repository work and interact. Moreover, the creation of ground truth data for evaluation and possible application scenarios for the system are shown.

Subject Areas :
Views 8
Downloads 1
 articleview.views 8
 articleview.downloads 1
  Cite this article 

Stefan Pletschacher, "Representation of Digitized Documents Using Document Specific Alphabets and Fontsin Proc. IS&T Archiving 2008,  2008,  pp 198 - 202,  https://doi.org/10.2352/issn.2168-3204.2008.5.1.art00040

 Copy citation
  Copyright statement 
Copyright © Society for Imaging Science and Technology 2008
72010361
Archiving Conference
archiving
2161-8798
Society of Imaging Science and Technology
7003 Kilworth Lane, Springfield, VA 22151, USA