We propose a novel algorithm for text/figure separation tailored for binary document images containing line drawings, block diagrams, charts, schemes and other kinds of business graphics. Most of the approaches for this task rely either on clever design of visual descriptor allowing to easily distinguish text and graphics regions or on the supervised learning using dataset of labeled text/figure regions. Such approaches often provide moderate separation accuracy when applied to document images which contain very diverse set of figure classes and lack sufficiently representative labeled training dataset. In contrast, our method is well-suited for vast variety of figure classes and capable of operating either in semi-supervised mode or unsupervised mode. We achieve this by leveraging unsupervised learning algorithms applied to Docstrum descriptors extracted from regions of interest and subsequent semi-supervised label propagation or unsupervised label inference. Another advantage of our method is its suitability for large scale data processing which is achieved through efficient kernel-approximating feature mapping applied to Docstrum descriptors and two-level clustering where fast mini-batch K-means algorithm is first applied to large scale data and only small number of resulting cluster centroids is subsequently processed by one of the more sophisticated clustering algorithms.
We present BGS (Big Graph Surfer), a scalable graph visualization tool that creates hierarchical structure from original graphs and provide interactive navigation along the hierarchy by expanding or collapsing clusters when visualizing large-scale graphs. A distributed computing framework-Spark provides the backend for BGS on clustering and visualization. This architecture makes it capable of visualizing a graph bigger than 1 billion nodes or edges in real-time after preprocessing. In addition, BGS provides a series of hierarchy and graph exploration methods, such as hierarchy view, hierarchy navigation, hierarchy search, graph view, graph navigation, graph search, and other useful interactions. These functionalities facilitate the exploration of very large-scale graphs. To evaluate the effectiveness of BGS, we apply BGS to several large-scale graph datasets, and discuss its scalability, usability, and flexibility.
Weather scientists are looking to better understand the atmospheric conditions. We propose a new tool to detect the most significant association between variables in the multidimensional multivariate time-varying climate datasets. In this case, we represent the correlation between variables, the uncertainty between different members within ensembles, and several clustering methods. 77w climate dataset is collected in different time steps and locations. One of the most important research questions for weather scientists is the relationship between various variables in different time steps, or dissimilar spatial locations. In this paper; we present a set of techniques to evaluate the correlation and association between different variables within a time step and spatial location. In another way, we perform static analysis on a single point in space-time, then extending that analysis either in the temporal or spatial dimensiorz(s), followed by an aggregation of the individual results to get an "overall" correlation. We created a tool that not only can he used to visualize the correlation and uncertainty between two time series of all ensembles, but also spatial locations. Mini-batch-K-Means clustering is applied to these datasets to identify the most substantial patterns within them. We study the Pearson correlation and integrate glyphs and color mapping into our design to demonstrate the trend of changing the correlation values of a single, pair: or triple of variables. Statistical calculations are applied to derive an accurate interpretation of the time-varying correlations between members within all of the ensembles as well as the uncertainty of the correlation values. The uncertainty visualizations provide insight toward the effects of parameter perturbation, sensitivity to initial conditions, and inconsistencies in model outputs. To evaluate the tool, we apply this technique to a climatology dataset.
The characterization and abstraction of large multivariate time series data often poses challenges with respect to effectiveness or efficiency. Using the example of human motion capture data challenges exist in creating compact solutions that still reflect semantics and kinematics in a meaningful way. We present a visual-interactive approach for the semi-supervised labeling of human motion capture data. Users are enabled to assign labels to the data which can subsequently be used to represent the multivariate time series as sequences of motion classes. The approach combines multiple views supporting the user in the visualinteractive labeling process. Visual guidance concepts further ease the labeling process by propagating the results of supportive algorithmic models. The abstraction of motion capture data to sequences of event intervals allows overview and detail-on-demand visualizations even for large and heterogeneous data collections. The guided selection of candidate data for the extension and improvement of the labeling closes the feedback loop of the semisupervised workflow. We demonstrate the effectiveness and the efficiency of the approach in two usage scenarios, taking visualinteractive learning and human motion synthesis as examples.
The exploration of text document collections is a complex and cumbersome task. Clustering techniques can help to group documents based on their content for the generation of overviews. However, the underlying clustering workflows comprising preprocessing, feature selection, clustering algorithm selection and parameterization offer several degrees of freedom. Since no "best" clustering workflow exists, users have to evaluate clustering results based on the data and analysis tasks at hand. In our approach, we present an interactive system for the creation and validation of text clustering workflows with the goal to explore document collections. The system allows users to control every step of the text clustering workflow. First, users are supported in the feature selection process via feature selection metrics-based feature ranking and linguistic filtering (e.g., part-of-speech filtering). Second, users can choose between different clustering methods and their parameterizations. Third, the clustering results can be explored based on the cluster content (documents and relevant feature terms), and cluster quality measures. Fourth, the results of different clusterings can be compared, and frequent document subsets in clusters can be identified. We validate the usefulness of the system with a usage scenario describing how users can explore document collections in a visual and interactive way.