Back to articles
Articles
Volume: 10 | Article ID: art00050
Image
An open source infrastructure for quality assurance and preservation of a large digital book collection
  DOI :  10.2352/issn.2168-3204.2013.10.1.art00050  Published OnlineJanuary 2013
Abstract

This article presents an open source infrastructure for processing large collections of digital books available at the Austrian National Library with a special focus on quality assurance tasks in the context of the European project SCAPE (www.scapeproject-eu). It describes the cluster hardware and the software components used for building the experimental IT infrastructure.More concretely, a set of best practices for the data analysis of large document image collections on the basis of Apache Hadoop will be shown. Different types of Hadoop jobs (Hadoop-Streaming-API, Hadoop MapReduce, and Hive) are used as basic components, and the Taverna workflow description language and execution engine (www.taverna.org.uk) is used for orchestrating complex data processing tasks.

Subject Areas :
Views 1
Downloads 0
 articleview.views 1
 articleview.downloads 0
  Cite this article 

Sven Schlarb, "An open source infrastructure for quality assurance and preservation of a large digital book collectionin Proc. IS&T Archiving 2013,  2013,  pp 234 - 238,  https://doi.org/10.2352/issn.2168-3204.2013.10.1.art00050

 Copy citation
  Copyright statement 
Copyright © Society for Imaging Science and Technology 2013
72010361
Archiving Conference
archiving
2161-8798
Society of Imaging Science and Technology
7003 Kilworth Lane, Springfield, VA 22151, USA