Back to articles
Articles
Volume: 10 | Article ID: art00041
Image
Improving Access to Web Archives through Innovative Analysis of PDF Content
  DOI :  10.2352/issn.2168-3204.2013.10.1.art00041  Published OnlineJanuary 2013
Abstract

In 2008 five United States institutions collaborated to archive the U.S. federal government Web presence: the Library of Congress, the Internet Archive, the California Digital Library, the Government Printing Office, and the University of North Texas (UNT). Their objective was to document the changes coincident with the shift in leadership of the U.S. executive branch. The five partners identified key resources from the U.S. gov Top Level Domain and completed crawls from September 2008 until March 2009. The resulting End of Term (EOT) 2008 Web Archive, a 16 TB dataset, was distributed to partners interested in providing local services and access to the archive. The UNT Libraries investigated Portable Document Format (PDF) files, a class of content many information professionals associate with the traditional notion of “discrete documents”. Over four million unique PDF documents were extracted from the Archive and a series of metadata and information extraction processes were conducted for each document. Additionally, derivative raster images of the first page of each document were created. These metrics were ingested into a database for further analysis, which brought to light previously hidden characteristics of the federal government's Web-published content. The paper discusses the overall workflow and describes the tools used to extract document features. Findings suggest opportunities for the development of retrieval tools that will provide new ways of selecting content and building collections from large Web archives.

Subject Areas :
Views 12
Downloads 1
 articleview.views 12
 articleview.downloads 1
  Cite this article 

Mark Phillips, Kathleen Murray, "Improving Access to Web Archives through Innovative Analysis of PDF Contentin Proc. IS&T Archiving 2013,  2013,  pp 186 - 192,  https://doi.org/10.2352/issn.2168-3204.2013.10.1.art00041

 Copy citation
  Copyright statement 
Copyright © Society for Imaging Science and Technology 2013
72010361
Archiving Conference
archiving
2161-8798
Society of Imaging Science and Technology
7003 Kilworth Lane, Springfield, VA 22151, USA