Characterizing Web Archive Content

Andrew Boyko

doi:10.2352/issn.2168-3204.2005.2.1.art00010

The Library of Congress has been collecting web content since 2000, first through its MINERVA project and, since 2004, as part of a broader Internet capture project. In addition to providing access to some collected content, we have begun to develop tools and techniques to better understand and preserve what we are collecting. When compared with other digital collections, content from the Web has some unique characteristics, such as naming issues and the varying types of relationships between items; nevertheless, when considered at the level of individual items, existing digital preservation approaches are entirely applicable.In this article, we describe some initial results from examining some selected content from this perspective, including the tools used in our analysis of the Library's Web collections, the approaches taken, and directions for further analysis. We intend that this information will be useful for guiding future web harvest and preservation efforts both within and outside the Library. Our goals include:• Identifying and measuring the content types in the collection;• Assessing the variation in file types and validity of “wild” Internet content; and• Determining typical attributes of various file types, to generate predictors for future web harvests.We describe web collections as a specific case of a collection of heterogeneous digital content, focusing on the content as received. We will not address issues relating to acquiring the content, such as retrieval problems and link detection during the web crawl, as these issues have been addressed in detail elsewhere and are ultimately orthogonal to preservation issues.