We present a method for transcribing historical newsprint images. To begin, for training and evaluation, we created a corpus of human-generated transcripts for almost 38,000 image snippets which contains nearly five million words. This may be one of the largest corpora of transcribed historical newsprint ever created. Then, we developed our automatic transcription process by leveraging the pattern recognition and statistical components of the state-of-the-art speech recognition toolkit, Kaldi. Specifically, we modified its language model behavior and we replaced Kaldi's speech-to-features transformation components with our own image-to-features process. Our replacement components include the use of word partials; image rotation; line segmentation which extends state-of-the-art methods; and customized feature generation. We conduct two evaluations of our technology: (a) an evaluation based on random selections of newspaper snippets; and (b) a diachronic evaluation of newspaper snippets by time frame. We compare our results of these evaluations to those of the commercial engines ABBYY Fine Reader Version 12 and OmniPage 18, as well as to the freely available system, Tesseract. We demonstrate that our process typically yields accuracies which are comparable to or exceed the accuracies of these other engines.
Patrick Schone, Alan Cannaday, Seth Stewart, Rachael Day, Jeremy Schone, "Automatic Transcription of Historical Newsprint by Leveraging the Kaldi Speech Recognition Toolkit" in Proc. IS&T Int’l. Symp. on Electronic Imaging: Document Recognition and Retrieval XXIII, 2016, https://doi.org/10.2352/ISSN.2470-1173.2016.17.DRR-062