We propose a novel algorithm for text/figure separation tailored for binary document images containing line drawings, block diagrams, charts, schemes and other kinds of business graphics. Most of the approaches for this task rely either on clever design of visual descriptor allowing to easily distinguish text and graphics regions or on the supervised learning using dataset of labeled text/figure regions. Such approaches often provide moderate separation accuracy when applied to document images which contain very diverse set of figure classes and lack sufficiently representative labeled training dataset. In contrast, our method is well-suited for vast variety of figure classes and capable of operating either in semi-supervised mode or unsupervised mode. We achieve this by leveraging unsupervised learning algorithms applied to Docstrum descriptors extracted from regions of interest and subsequent semi-supervised label propagation or unsupervised label inference. Another advantage of our method is its suitability for large scale data processing which is achieved through efficient kernel-approximating feature mapping applied to Docstrum descriptors and two-level clustering where fast mini-batch K-means algorithm is first applied to large scale data and only small number of resulting cluster centroids is subsequently processed by one of the more sophisticated clustering algorithms.
Current wearable camera and computer technology opens the way for preservation of every printed, computer mediated and spoken word that an individual has ever seen or heard. Text images acquired autonomously at one frame per second by a 20 megapixel miniature camera and recorded speech, both with GPS tags, can be uploaded and stored permanently on available mobile or desktop devices. After culling redundant images and mosaicking fragments, the text can be transcribed, tagged, indexed and summarized. A combination of already developed methods of information retrieval, web science and cognitive computing will enable selective retrieval of the accumulated information. New issues are engendered by the potential advent of microcosms of personal information at a scale of about 1:1,000,000 of the World Wide Web.