72010604

Electronic Imaging

2470-1173

Society for Imaging Science and Technology

10.2352/ISSN.2470-1173.2016.17.DRR-065

2470-1173(20160217)2016:17L.1;1-

s14.phd

/ist/ei/2016/00002016/00000017/art00014

Revisiting Known-Item Retrieval in Degraded Document Collections

Soo

Jason

Frieder

Ophir

17 02 2016

2016 17 1 9

2016

Optical character recognition software converts an image of text to a text document but typically degrades the document's contents. Correcting such degradation to enable the document set to be queried effectively is the focus of this work. The described approach uses a fusion of substring generation rules and context aware analysis to correct these errors. Evaluation was facilitated by two publicly available datasets from TREC-5's Confusion Track containing estimated error rates of 5% and 20% . On the 5% dataset, we demonstrate a statistically significant improvement over the prior art and Solr's mean reciprocal rank (MRR). On the 20% dataset, we demonstrate a statistically significant improvement over Solr, and have similar performance to the prior art. The described approach achieves an MRR of 0.6627 and 0.4924 on collections with error rates of approximately 5% and 20% respectively.