An open source infrastructure for quality assurance and preservation of a large digital book collection

Sven Schlarb

doi:10.2352/issn.2168-3204.2013.10.1.art00050

Abstract

This article presents an open source infrastructure for processing large collections of digital books available at the Austrian National Library with a special focus on quality assurance tasks in the context of the European project SCAPE (www.scapeproject-eu). It describes the cluster hardware and the software components used for building the experimental IT infrastructure.More concretely, a set of best practices for the data analysis of large document image collections on the basis of Apache Hadoop will be shown. Different types of Hadoop jobs (Hadoop-Streaming-API, Hadoop MapReduce, and Hive) are used as basic components, and the Taverna workflow description language and execution engine (www.taverna.org.uk) is used for orchestrating complex data processing tasks.

72010361

Archiving Conference

archiving

2161-8798

Society of Imaging Science and Technology

7003 Kilworth Lane, Springfield, VA 22151, USA

2161-8798(20130101)2013:1L.234;1-

ac_v2013n1/splitsection50.xml

/ist/ac/2013/00002013/00000001/art00050

Articles

An open source infrastructure for quality assurance and preservation of a large digital book collection

SchlarbSven

01012013

2013

234

238

2013