Subject of this paper is the architecture of the prototype implementation developed in the E-ARK project. It is specifically designed to support scalable and efficient data transformation, information extraction from archival information packages, and full-text search in the repository. As a continuation of previous work related to the use of Hadoop to process large data volumes, it presents a combined approach of using a distributed task queue for parallel processing together with Hadoop and HBase to allow computing intensive and long-running tasks being applied during ingest as well as the full-text indexing of very large document collections.
Sven Schlarb, Rainer Schmidt, Mihai Bartha, Roman Karl, "Scalable Processing and Search in Package-based Repositories" in Proc. IS&T Archiving 2016, 2016, https://doi.org/10.2352/issn.2168-3204.2016.1.0.24