There are many existing document image classification researches, but most of them are not designed for use in constrained computer resources, like printers, or focused on documents with highlighter pen marks. To enable printers to better discriminate highlighted documents, we designed
a set of features in CIE Lch(a* b*) space to use along with the support vector machine. The features include two gamut-based features and six low-level color features. By first identifying the highlight pixels, and then computing the distance from the highlight pixels to the boundary of the
printer gamut, the gamut-based features can be obtained. The low-level color features are built upon the color distribution information of the image blocks. The best feature subset of the existing and new features is constructed by sequential forward floating selection (SFFS) feature selection.
Leave-one-out cross-validation is performed on a dataset with 400 document images to evaluate the effectiveness of the classification model. The cross-validation results indicate significant improvements over the baseline highlighted document classification model.