The presence of handwritten text and annotations combined with typewritten and machine-printed text in historical archival records make them visually complex, posing challenges for OCR systems in accurately transcribing their content. This paper is an extension of [1], reporting on improvements in the separation of handwritten text from machine-printed text (including typewriters), by the use of FCN-based models trained on datasets created from different data synthesis pipelines. Results show a significant increase of about 20% in the intrinsic evaluation on artificial test sets, and 8% improvement in the extrinsic evaluation on a subsequent OCR task on real archival documents.
The Digital Archives went in 2019 from being The National Archives of Norway’s own digital platform to become Norway’s joint national digital platform for receiving, preserving, and publishing digitized/media-converted historical archives. Regardless if you represent state, municipal, or private actors, small or large, the platform is free of charge and use for the Norwegian archive institutions. The digital platform was first published in 1998, marking 25 years in 2023.