Back to articles
Volume: 2 | Article ID: art00010
Characterizing Web Archive Content
  DOI :  10.2352/issn.2168-3204.2005.2.1.art00010  Published OnlineJanuary 2005

The Library of Congress has been collecting web content since 2000, first through its MINERVA project and, since 2004, as part of a broader Internet capture project. In addition to providing access to some collected content, we have begun to develop tools and techniques to better understand and preserve what we are collecting. When compared with other digital collections, content from the Web has some unique characteristics, such as naming issues and the varying types of relationships between items; nevertheless, when considered at the level of individual items, existing digital preservation approaches are entirely applicable.In this article, we describe some initial results from examining some selected content from this perspective, including the tools used in our analysis of the Library's Web collections, the approaches taken, and directions for further analysis. We intend that this information will be useful for guiding future web harvest and preservation efforts both within and outside the Library. Our goals include:• Identifying and measuring the content types in the collection;• Assessing the variation in file types and validity of “wild” Internet content; and• Determining typical attributes of various file types, to generate predictors for future web harvests.We describe web collections as a specific case of a collection of heterogeneous digital content, focusing on the content as received. We will not address issues relating to acquiring the content, such as retrieval problems and link detection during the web crawl, as these issues have been addressed in detail elsewhere and are ultimately orthogonal to preservation issues.

Subject Areas :
Views 12
Downloads 0
 articleview.views 12
 articleview.downloads 0
  Cite this article 

Andrew Boyko, "Characterizing Web Archive Contentin Proc. IS&T Archiving 2005,  2005,  pp 43 - 47,

 Copy citation
  Copyright statement 
Copyright © Society for Imaging Science and Technology 2005
Archiving Conference
Society of Imaging Science and Technology
7003 Kilworth Lane, Springfield, VA 22151, USA