<!DOCTYPE article PUBLIC '-//NLM//DTD Journal Publishing DTD v2.1 20050630//EN' 'http://uploads.ingentaconnect.com/docs/dtd/ingenta-journalpublishing.dtd'>
<article article-type="research-article">
  <front>
    <journal-meta>
      <journal-id journal-id-type="aggregator">72010604</journal-id>
      <journal-title>Electronic Imaging</journal-title>
      <issn pub-type="ppub">2470-1173</issn><issn pub-type="epub"></issn>
      <publisher>
        <publisher-name>Society for Imaging Science and Technology</publisher-name>
      </publisher>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.2352/ISSN.2470-1173.2016.17.DRR-065</article-id>
      <article-id pub-id-type="sici">2470-1173(20160217)2016:17L.1;1-</article-id>
      <article-id pub-id-type="publisher-id">s14.phd</article-id>
      <article-id pub-id-type="other">/ist/ei/2016/00002016/00000017/art00014</article-id>
      <article-categories>
        <subj-group>
          <subject/>
        </subj-group>
      </article-categories>
      <title-group>
        <article-title>Revisiting Known-Item Retrieval in Degraded Document Collections</article-title>
      </title-group>
      <contrib-group>
        <contrib>
          <name>
            <surname>Soo</surname>
            <given-names>Jason</given-names>
          </name>
        </contrib>
        <contrib>
          <name>
            <surname>Frieder</surname>
            <given-names>Ophir</given-names>
          </name>
        </contrib>
      </contrib-group>
      <pub-date>
        <day>17</day>
        <month>02</month>
        <year>2016</year>
      </pub-date>
      <volume>2016</volume>
      <issue>17</issue>
      <fpage>1</fpage>
      <lpage>9</lpage>
      <permissions>
        <copyright-year>2016</copyright-year>
      </permissions>
      <abstract>
        <p>Optical character recognition software converts an image of text to a text document but typically degrades the document's contents. Correcting such degradation to enable the document set to be queried effectively is the focus of this work. The described approach uses a fusion of substring
 generation rules and context aware analysis to correct these errors. Evaluation was facilitated by two publicly available datasets from TREC-5's Confusion Track containing estimated error rates of 5% and 20% . On the 5% dataset, we demonstrate a statistically significant improvement over the
 prior art and Solr's mean reciprocal rank (MRR). On the 20% dataset, we demonstrate a statistically significant improvement over Solr, and have similar performance to the prior art. The described approach achieves an MRR of 0.6627 and 0.4924 on collections with error rates of approximately
 5% and 20% respectively.</p>
      </abstract>
    </article-meta>
  </front>
</article>
