
Acquired in 2019 by a consortium of philanthropic and cultural heritage organizations, the Johnson Publishing Company (JPC) Archive is co-owned by the Getty Research Institute (GRI) and Smithsonian National Museum of African American History and Culture (NMAAHC). Dating from 1942, when John H. and Eunice W. Johnson founded the company, to the 21st century, the JPC Archive contains over 4 million photographs of published and unpublished works documenting the Black experience, some of which were featured in JPC’s 14 magazines, most notably JET and Ebony. In addition to the historically significant events and behind-the-scenes moments depicted, the Archive presents an unmatched and unique record of many facets of the life, work, and contributions of Black individuals, communities, groups, organizations, and businesses. Working collaboratively across the United States (from Los Angeles to Chicago to Washington, DC), these two large cultural heritage institutions currently co-steward this collection, with each focusing on their strengths to bring this remarkable and unique collection to the public.

The usability and accessibility of digitised archival data can be improved using deep learning solutions. In this paper, the authors present their work in developing a named entity recognition (NER) model for digitised archival data, specifically state authority documents. The entities for the model were chosen based on surveying different user groups. In addition to common entities, two new entities were created to identify businesses (FIBC) and archival documents (JON). The NER model was trained by fine-tuning an existing Finnish BERT model. The training data also included modern digitally born texts to achieve good performance with various types of inputs. The finished model performs fairly well with OCR-processed data, achieving an overall F1 score of 0.868, and particularly well with the new entities (F1 scores of 0.89 and 0.97 for JON and FIBC, respectively).

The Smithsonian Institution Digitization Program Office’s Collection Digitization team develops and designs a “three-pronged” workflow approach to mass digitization of museum collections, called the Physical, Imaging, and Virtual Workflows. This approach addresses proper handling of objects, optimizing capture throughputs, and streamlines the processing and delivery of images through automation. The Physical Workflow Design defines the production space and safe movement of objects from storage to the digitization production space; the Imaging Workflow Design defines the technical specifications, file deliverables, and the results of our ‘Item Driven Image Fidelity’ (IDIF) testing; and finally, the Virtual Workflow Design defines the lifecycle of the digital file, from creation to online access, describing the various data processes required for success.

The Hoover Institution Library & Archives (HILA) has implemented Smartsheet, a cloud-based project management tool, to manage tasks and cross-team handoffs for its new mass digitization program. By combining task-specific tools such as Capture One and LIMB Processing with the administrative flexibility of Smartsheet, HILA has succeeded in leveraging commercial project management functionality for cultural heritage purposes, resulting in improvements to our program’s efficiency, flexibility, and reporting capabilities.