Back to articles
Volume: 28 | Article ID: art00014
Image
Revisiting Known-Item Retrieval in Degraded Document Collections
  DOI :  10.2352/ISSN.2470-1173.2016.17.DRR-065  Published OnlineFebruary 2016
Abstract

Optical character recognition software converts an image of text to a text document but typically degrades the document's contents. Correcting such degradation to enable the document set to be queried effectively is the focus of this work. The described approach uses a fusion of substring generation rules and context aware analysis to correct these errors. Evaluation was facilitated by two publicly available datasets from TREC-5's Confusion Track containing estimated error rates of 5% and 20% . On the 5% dataset, we demonstrate a statistically significant improvement over the prior art and Solr's mean reciprocal rank (MRR). On the 20% dataset, we demonstrate a statistically significant improvement over Solr, and have similar performance to the prior art. The described approach achieves an MRR of 0.6627 and 0.4924 on collections with error rates of approximately 5% and 20% respectively.

Subject Areas :
Views 21
Downloads 0
 articleview.views 21
 articleview.downloads 0
  Cite this article 

Jason Soo, Ophir Frieder, "Revisiting Known-Item Retrieval in Degraded Document Collectionsin Proc. IS&T Int’l. Symp. on Electronic Imaging: Document Recognition and Retrieval XXIII,  2016,  https://doi.org/10.2352/ISSN.2470-1173.2016.17.DRR-065

 Copy citation
  Copyright statement 
Copyright © Society for Imaging Science and Technology 2016
72010604
Electronic Imaging
2470-1173
Society for Imaging Science and Technology