Back to articles
Volume: 28 | Article ID: art00011
Image
Automatic Transcription of Historical Newsprint by Leveraging the Kaldi Speech Recognition Toolkit
  DOI :  10.2352/ISSN.2470-1173.2016.17.DRR-062  Published OnlineFebruary 2016
Abstract

We present a method for transcribing historical newsprint images. To begin, for training and evaluation, we created a corpus of human-generated transcripts for almost 38,000 image snippets which contains nearly five million words. This may be one of the largest corpora of transcribed historical newsprint ever created. Then, we developed our automatic transcription process by leveraging the pattern recognition and statistical components of the state-of-the-art speech recognition toolkit, Kaldi. Specifically, we modified its language model behavior and we replaced Kaldi's speech-to-features transformation components with our own image-to-features process. Our replacement components include the use of word partials; image rotation; line segmentation which extends state-of-the-art methods; and customized feature generation. We conduct two evaluations of our technology: (a) an evaluation based on random selections of newspaper snippets; and (b) a diachronic evaluation of newspaper snippets by time frame. We compare our results of these evaluations to those of the commercial engines ABBYY Fine Reader Version 12 and OmniPage 18, as well as to the freely available system, Tesseract. We demonstrate that our process typically yields accuracies which are comparable to or exceed the accuracies of these other engines.

Subject Areas :
Views 19
Downloads 2
 articleview.views 19
 articleview.downloads 2
  Cite this article 

Patrick Schone, Alan Cannaday, Seth Stewart, Rachael Day, Jeremy Schone, "Automatic Transcription of Historical Newsprint by Leveraging the Kaldi Speech Recognition Toolkitin Proc. IS&T Int’l. Symp. on Electronic Imaging: Document Recognition and Retrieval XXIII,  2016,  https://doi.org/10.2352/ISSN.2470-1173.2016.17.DRR-062

 Copy citation
  Copyright statement 
Copyright © Society for Imaging Science and Technology 2016
72010604
Electronic Imaging
2470-1173
Society for Imaging Science and Technology