There are many existing document image classification researches, but most of them are not designed for use in constrained computer resources, like printers, or focused on documents with highlighter pen marks. To enable printers to better discriminate highlighted documents, we designed a set of features in CIE Lch(a* b*) space to use along with the support vector machine. The features include two gamut-based features and six low-level color features. By first identifying the highlight pixels, and then computing the distance from the highlight pixels to the boundary of the printer gamut, the gamut-based features can be obtained. The low-level color features are built upon the color distribution information of the image blocks. The best feature subset of the existing and new features is constructed by sequential forward floating selection (SFFS) feature selection. Leave-one-out cross-validation is performed on a dataset with 400 document images to evaluate the effectiveness of the classification model. The cross-validation results indicate significant improvements over the baseline highlighted document classification model.
With the increasing demand to scan text documents and old books, having a scanner that could automatically detect the orientations of the scanned pages would be greatly beneficial. This paper proposes a fast method to detect orientations based on a support vector machine (SVM), using features developed for each connected component on the scanned page. Results show that the algorithm can achieve an accuracy of 99.2% in orientation detection and 98.2% in script detection for pages scanned at 200 dpi.