Historical archival records present many challenges for OCR systems to correctly encode their content, due to visual complexity, e.g. mixed printed text and handwritten annotations, paper degradation, and faded ink. This paper addresses the problem of automatic identification and separation of handwritten and printed text in historical archival documents, including the creation of an artificial pixel-level annotated dataset and the presentation of a new FCN-based model trained on historical data. Initial test results indicate 18% IoU performance improvement on recognition of printed pixels and 10% IoU performance improvement on recognition of handwritten pixels in synthesised data when compared to the state-of-the-art trained on modern documents. Furthermore, an extrinsic OCR-based evaluation on the printed layer extracted from real historical documents shows 26% performance increase.
This article aims to present the experience gained during the development and implementation of a digitization system for cultural heritage collections using a cell phone. This project was developed in four stages: search and creation of the system, training of professionals assigned to operate the equipment, writing the guidelines that summarize the knowledge obtained and, finally, monitoring the results and disseminating the digital surrogates via Wikimedia. _x005F_x000D_ _x005F_x000D_ The digital team of the Moreira Salles Institute developed this project in partnership with Institute Goethe and Wiki Movimento Brasil between October 2021 and February 2022. The main goal is based on the understanding that different digitization methods can meet different needs and resources that are available for cultural heritage institutions, and the goal of democratizing knowledge and contributing to public access to collections of fundamental importance for Brazilian history.