This article presents an open source infrastructure for processing large collections of digital books available at the Austrian National Library with a special focus on quality assurance tasks in the context of the European project SCAPE (www.scapeproject-eu). It describes the cluster hardware and the software components used for building the experimental IT infrastructure.More concretely, a set of best practices for the data analysis of large document image collections on the basis of Apache Hadoop will be shown. Different types of Hadoop jobs (Hadoop-Streaming-API, Hadoop MapReduce, and Hive) are used as basic components, and the Taverna workflow description language and execution engine (www.taverna.org.uk) is used for orchestrating complex data processing tasks.
Sven Schlarb, "An open source infrastructure for quality assurance and preservation of a large digital book collection" in Proc. IS&T Archiving 2013, 2013, pp 234 - 238, https://doi.org/10.2352/issn.2168-3204.2013.10.1.art00050