We propose a framework that can be used to create artificial ground-truth data for document images. The resulting data can then be used to train machine-learning systems to perform page segmentation tasks. The main focus of this system is on images of historical documents. The framework creates document images with headlines of differing sizes, multiple column layouts, pictures and decorative elements. To improve the resemblance with historical document images, a set of backgrounds is created manually by extracting background textures from real historical documents. The fading and curling typical of old manuscripts are also simulated. Experiments with a neural network – trained on data generated using the proposed framework and applied to realworld images – show promising results with robust segmentation of text and non-text image areas.
Oliver Paetzel, Hauke Bluhm, "Creating artificial ground-truth data for document image page segmentation" in Proc. IS&T Archiving 2019, 2019, pp 76 - 80, https://doi.org/10.2352/issn.2168-3204.2019.1.0.17