<?xml version="1.0"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.1 20050630//EN" "http://uploads.ingentaconnect.com/docs/dtd/ingenta-journalpublishing.dtd">
<article article-type="research-article">
  <front>
    <journal-meta>
      <journal-id journal-id-type="aggregator">72010361</journal-id>
      <journal-title>Archiving Conference</journal-title>
      <abbrev-journal-title>archiving</abbrev-journal-title>
      <issn pub-type="ppub">2161-8798</issn><issn pub-type="epub"/>
      <publisher>
        <publisher-name>Society of Imaging Science and Technology</publisher-name>
        <publisher-loc>7003 Kilworth Lane, Springfield, VA 22151, USA</publisher-loc>
      </publisher>
    </journal-meta>
    <article-meta><article-id pub-id-type="doi">10.2352/issn.2168-3204.2013.10.1.art00050</article-id>
      <article-id pub-id-type="sici">2161-8798(20130101)2013:1L.234;1-</article-id>
      <article-id pub-id-type="publisher-id">ac_v2013n1/splitsection50.xml</article-id>
      <article-id pub-id-type="other">/ist/ac/2013/00002013/00000001/art00050</article-id>
      <article-categories>
        <subj-group>
          <subject>Articles</subject>
        </subj-group>
      </article-categories>
      <title-group>
        <article-title>An open source infrastructure for quality assurance and preservation of a large digital book collection</article-title>
      </title-group>
      <contrib-group>
        <contrib>
          <name>
            <surname>Schlarb</surname>
            <given-names>Sven</given-names>
          </name>
        </contrib>
      </contrib-group>
      <pub-date>
        <day>01</day>
        <month>01</month>
        <year>2013</year>
      </pub-date>
      <volume>2013</volume>
      <issue>1</issue>
      <fpage>234</fpage>
      <lpage>238</lpage>
      <permissions>
        <copyright-year>2013</copyright-year>
      </permissions>
      <abstract>
        <p>This article presents an open source infrastructure for processing large collections of digital books available at the Austrian National Library with a special focus on quality assurance tasks in the context of the European project SCAPE (www.scapeproject-eu). It describes the cluster
 hardware and the software components used for building the experimental IT infrastructure.More concretely, a set of best practices for the data analysis of large document image collections on the basis of Apache Hadoop will be shown. Different types of Hadoop jobs (Hadoop-Streaming-API,
 Hadoop MapReduce, and Hive) are used as basic components, and the Taverna workflow description language and execution engine (www.taverna.org.uk) is used for orchestrating complex data processing tasks.</p>
      </abstract>
    </article-meta>
  </front>
</article>
