Document segmentation is defined as distinguishing different parts of the document image based on contents. In this paper, the document image is segmented into texts, pictures, and background. The algorithm we proposed includes background removal, block segmentation, feature extraction, and recognition. In background removal, we use local thresholds to extract foreground of the image. In block segmentation, run-length smoothing algorithm and connected component analysis are applied to divide the document image into a set of regions. And then, the features including image features and geometry features from the regions are extracted. Finally, these features are fed into the classifier which is a three-layer back-propagation neural network. The output of the neural network is the result of the recognition: texts or pictures. Through the experiments, we know that most document images with simple backgrounds can be segmented well by the method we proposed. Therefore, there are several advantages in our document segmentation system. 1. Localized thresholds to distinguish foreground from background based on color concepts. 2. Able to discriminate texts from pictures by extraction of good features. 3. Use a trainable neural network as the classifier where the structure can be adjusted flexibly. 4. Precise segmentation since the classifier is trained by mass of document images.
Hsiao-Yu Han, "A Neural Network Based Color Document Segmentation" in Proc. IS&T Int'l Conf. on Digital Printing Technologies (NIP19), 2003, pp 859 - 864, https://doi.org/10.2352/ISSN.2169-4451.2003.19.1.art00098_2