72010361

Archiving Conference

archiving

2161-8798

Society of Imaging Science and Technology

7003 Kilworth Lane, Springfield, VA 22151, USA

10.2352/issn.2168-3204.2013.10.1.art00050

2161-8798(20130101)2013:1L.234;1-

ac_v2013n1/splitsection50.xml

/ist/ac/2013/00002013/00000001/art00050

Articles

An open source infrastructure for quality assurance and preservation of a large digital book collection

Schlarb

Sven

01 01 2013

2013 1 234 238

2013

This article presents an open source infrastructure for processing large collections of digital books available at the Austrian National Library with a special focus on quality assurance tasks in the context of the European project SCAPE (www.scapeproject-eu). It describes the cluster hardware and the software components used for building the experimental IT infrastructure.More concretely, a set of best practices for the data analysis of large document image collections on the basis of Apache Hadoop will be shown. Different types of Hadoop jobs (Hadoop-Streaming-API, Hadoop MapReduce, and Hive) are used as basic components, and the Taverna workflow description language and execution engine (www.taverna.org.uk) is used for orchestrating complex data processing tasks.