Implementation of a high performance architecture for managing and storing web-harvested collections

Michael Smorul; Joseph JaJa

doi:10.2352/issn.2168-3204.2011.8.1.art00003

As institutions continue to grow their collections of web-harvested content, there is an ever increasing need for tools that organize, index and share this data. Even a modest web crawl consisting of a few web sites may generate millions of harvested documents. Repeating these crawls over time greatly expands the complexity of stored data. Identifying the scope of a crawl, the location of a page within a crawl and the differences over time between crawls becomes a challenging task. In this paper we will describe a software architecture in use at the University of Maryland designed to support research on quickly extracting information about the crawls, including statistical information, and on indexing web content. While designed to support research, many of the challenges addressed in this software exist at any site which has to manage large sets of time-spanning data.Our architecture consists of two components. The first is a database application for organizing WARC-based web data called a WarcManager. The WarcManager was designed to track URL location and to allow easy extraction of crawl statistics across collections of warc-stored data. It provides both a REST-based API to harvested data as well as a portal for viewing statistics across the collection. The second component is a high performance, http based, storage service called the Simple Web-Accessible Preservation(SWAP) system. The SWAP system is distributed, novel file placement and retrieval service. It has been designed to be minimally intrusive and to allow complete data recovery even in the absence of any SWAP software.These two components have been used to successfully support research into high performance indexing of web-based content. We will describe the implementation and performance characteristics of each component as well as possible real-world uses for the system.