Optical character recognition software converts an image of text to a text document but typically degrades the document's contents. Correcting such degradation to enable the document set to be queried effectively is the focus of this work. The described approach uses a fusion of substring generation rules and context aware analysis to correct these errors. Evaluation was facilitated by two publicly available datasets from TREC-5's Confusion Track containing estimated error rates of 5% and 20% . On the 5% dataset, we demonstrate a statistically significant improvement over the prior art and Solr's mean reciprocal rank (MRR). On the 20% dataset, we demonstrate a statistically significant improvement over Solr, and have similar performance to the prior art. The described approach achieves an MRR of 0.6627 and 0.4924 on collections with error rates of approximately 5% and 20% respectively.
Jason Soo, Ophir Frieder, "Revisiting Known-Item Retrieval in Degraded Document Collections" in Proc. IS&T Int’l. Symp. on Electronic Imaging: Document Recognition and Retrieval XXIII, 2016, https://doi.org/10.2352/ISSN.2470-1173.2016.17.DRR-065