Semantic segmentation, classifying each pixel in an image to a set of various objects, is an important and necessary problem to understand images. In recent years, convolutional neural networks trained with public datasets enable to segment objects and understand images. However, it is still challenging to segment objects with high accuracy on a simple and small network. In this work, we describe convolutional neural networks with dilated convolutions to segment person accurately especially near boundary using data augmentation technique. Additionally, we develop a smaller network which can run each frame in webcam video faster without degrading segmentation performance. Our method both numerically and visually outperforms other segmentation techniques.
There has been a growing interest in using different approaches to improve the coding efficiency of modern video codec in recent years as demand for web-based video consumption increases. In this paper, we propose a model-based approach that uses texture analysis/synthesis to reconstruct blocks in texture regions of a video to achieve potential coding gains using the AV1 codec developed by the Alliance for Open Media (AOM). The proposed method uses convolutional neural networks to extract texture regions in a frame, which are then reconstructed using a global motion model. Our preliminary results show an increase in coding efficiency while maintaining satisfactory visual quality.
Convolutional neural networks (CNNs) have improved the field of computer vision in the past years and allow groundbreaking new and fast automatic results in various scenarios. However, the training effect of CNNs when only scarce data are available is not yet examined in detail. Transfer learning is a technique that helps overcoming training data shortage by adapting trained models to a different but related target task. We investigate the transfer learning performance of pre-trained CNN models on variably sized training datasets for binary classification problems, which resemble the discrimination between relevant and irrelevant content within a restricted context. This often plays a role in data triage applications such as screening seized storage devices for means of evidence. The evaluation of our work shows that even with a small number of training examples, the models can achieve promising performances of up to 96% accuracy. We apply those transferred models to data triage by using the softmax outputs of the models to rank unseen images according to their assigned probability of relevance. This provides a tremendous advantage in many application scenarios where large unordered datasets have to be screened for certain content.
The analysis of complex structured data like video has been a long-standing challenge for computer vision algorithms. Innovative deep learning architectures like Convolutional Neural Networks (CNNs), however are demonstrating remarkable performance in challenging image and video understanding tasks. In this work we propose a architecture for the automated detection of scored points during tennis matches. We explore two approaches based on CNNs for the analysis of video streams of broadcasted tennis games. We first explore the two-stream approach, which involves extracting features related to either pixel intensity values via the analysis of grayscale frames or the encoding of motion related information via optical flow. However, we explore the case of using higher order 3D CNN for simultaneously encoding both spatial and temporal correlations. Furthermore, we explore the late fusion of the individual stream in order to extract and encode both structural and motion spatio-temporal dynamics. We validate the merits of the proposed scheme using a novel manually annotated dataset created from publically available videos.
Spectral information obtained by hyperspectral sensors enables better characterization, identification and classification of the objects in a scene of interest. Unfortunately, several factors have to be addressed in the classification of hyperspectral data, including the acquisition process, the high dimensionality of spectral samples, and the limited availability of labeled data. Consequently, it is of great importance to design hyperspectral image classification schemes able to deal with the issues of the curse of dimensionality, and simultaneously produce accurate classification results, even from a limited number of training data. To that end, we propose a novel machine learning technique that addresses the hyperspectral image classification problem by employing the state-of-the-art scheme of Convolutional Neural Networks (CNNs). The formal approach introduced in this work exploits the fact that the spatio-spectral information of an input scene can be encoded via CNNs and combined with multi-class classifiers. We apply the proposed method on novel dataset acquired by a snapshot mosaic spectral camera and demonstrate the potential of the proposed approach for accurate classification.
A visual system cannot process everything with full fidelity, nor, in a given moment, perform all possible visual tasks. Rather, it must lose some information, and prioritize some tasks over others. The human visual system has developed a number of strategies for dealing with its limited capacity. This paper reviews recent evidence for one strategy: encoding the visual input in terms of a rich set of local image statistics, where the local regions grow — and the representation becomes less precise — with distance from fixation. The explanatory power of this proposed encoding scheme has implications for another proposed strategy for dealing with limited capacity: that of selective attention, which gates visual processing so that the visual system momentarily processes some objects, features, or locations at the expense of others. A lossy peripheral encoding offers an alternative explanation for a number of phenomena used to study selective attention. Based on lessons learned from studying peripheral vision, this paper proposes a different characterization of capacity limits as limits on decision complexity. A general-purpose decision process may deal with such limits by "cutting corners" when the task becomes too complicated.
Recent advances in computational models in vision science have considerably furthered our understanding of human visual perception. At the same time, rapid advances in convolutional deep neural networks (DNNs) have resulted in computer vision models of object recognition which, for the first time, rival human object recognition. Furthermore, it has been suggested that DNNs may not only be successful models for computer vision, but may also be good computational models of the monkey and human visual systems. The advances in computational models in both vision science and computer vision pose two challenges in two different and independent domains: First, because the latest computational models have much higher predictive accuracy, and competing models may make similar predictions, we require more human data to be able to statistically distinguish between different models. Thus we would like to have methods to acquire trustworthy human behavioural data fast and easy. Second, we need challenging experiments to ascertain whether models show similar input-output behaviour only near "ceiling" performance, or whether their performance degrades similar to human performance: only then do we have strong evidence that models and human observers may be using similar features and processing strategies. In this paper we address both challenges.
In recent years, Convolutional Neural Networks (CNNs) have gained huge popularity among computer vision researchers. In this paper, we investigate how features learned by these networks in a supervised manner can be used to define a measure of self-similarity, an image feature that characterizes many images of natural scenes and patterns, and is also associated with images of artworks. Compared to a previously proposed method for measuring self-similarity based on oriented luminance gradients, our approach has two advantages. Firstly, we fully take color into account, an image feature which is crucial for vision. Secondly, by using higher-layer CNN features, we define a measure of selfsimilarity that relies more on image content than on basic local image features, such as luminance gradients.
Finding an objective image quality metric that matches the subjective quality has always been a challenging task. We propose a new full reference image quality metric based on features extracted from Convolutional Neural Networks (CNNs). Using a pre-trained AlexNet model, we extract feature maps of the test and reference images at multiple layers, and compare their feature similarity at each layer. Such similarity scores are then pooled across layers to obtain an overall quality value. Experimental results on four state-of-the-art databases show that our metric is either on par or outperforms 10 other state-of-the-art metrics, demonstrating that CNN features at multiple levels are superior to handcrafted features used in most image quality metrics in capturing aspects that matter for discriminative perception. © 2016 Society for Imaging Science and Technology.
According to the National Highway Traffic Safety Administration, one in ten fatal crashes and two in ten injury crashes were reported as distracted driver accidents in the United State during 2014. In an attempt to mitigate these alarming statistics, this paper explores using a dashboard camera along with computer vision and machine learning to automatically detect distracted drivers. We consider a dataset that incorporates drivers engaging in seven different distracting behaviors using left and/or right hands. Traditional handcrafted features paired with a Support Vector Machine classifier are contrasted with deep Convolutional Neural Networks. The traditional features include a blend of Histogram of Oriented Gradients and Scale-Invariant Feature Transform descriptors used to create Bags of Words. The deep convolutional methods use transfer learning on AlexNet, VGG-16, and ResNet-152. The results yield 85% accuracy with ResNet and 82.5% accuracy with VGG-16, which outperformed AlexNet by almost 10%. Replacing the fully connected layers by a Support Vector Machine classifier did not improve the classification accuracy. The traditional features yielded much lower accuracy than the deep convolutional networks.