72010361

Archiving Conference

archiving

2161-8798

Society for Imaging Science and Technology

10.2352/issn.2168-3204.2016.1.0.24

2161-8798(20160419)2016:1L.24;1-

s7.phd

/ist/ac/2016/00002016/00000001/art00007

Articles

Scalable Processing and Search in Package-based Repositories

Schlarb

Sven

Schmidt

Rainer

Bartha

Mihai

Karl

Roman

19 04 2016

2016 1 24 27

2016

Subject of this paper is the architecture of the prototype implementation developed in the E-ARK project. It is specifically designed to support scalable and efficient data transformation, information extraction from archival information packages, and full-text search in the repository. As a continuation of previous work related to the use of Hadoop to process large data volumes, it presents a combined approach of using a distributed task queue for parallel processing together with Hadoop and HBase to allow computing intensive and long-running tasks being applied during ingest as well as the full-text indexing of very large document collections.