We propose a novel architecture based on the strucuture of AutoEncoders. The paper introduces CrossEncoders - an AutoEncoder architecture which uses cross-connections to connect layers (both adjacent and non-adjacent) in the encoder and decoder side of the network respectively. The network incorporates both global and local information in the lower dimension code. We aim for an image compression algorithm that has reduced training time and better generalization property. The use of cross-connections makes the training of our network significantly faster. The performance of the proposed framework has been evaluated using real-world data from highly competitive datasets like MNIST and CIFAR-10. Furthermore, we show that the proposed architecture provides high compression ratio and is robust as compared to previously proposed architectures and PCA. The results were validated using metrics, such as PSNR-HVS and PSNR-HVS-M respectively.
In VP9 , a 64×64 superblock can be recursively decomposed all the way to blocks of size 4×4 . The encoder performs the encoding process for each possible partitioning and the optimal one is selected by minimizing the rate and distortion cost. This scheme ensures the encoding quality, but also brings in large computational complexity and substantial CPU resources. In this paper, to speed up the partition search without sacrificing the quality, we propose a multi-level machine learning-based early termination scheme. One weighted Support Vector Machine classifier is trained for each block size. The binary classifiers are used to determine that provided a block, whether it is necessary to continue the search down to smaller blocks, or to perform the early termination and take the current block size as the final one. Moreover, the classifiers are trained with varying error-tolerance for different block sizes, i.e., a stricter error-tolerance is adopted for larger block size compared with the smaller ones to control the encoder performance drop. Extensive experimental results demonstrate that for HD and 4K videos, the proposed framework accomplishes remarkable speed-up (20-25%) with less than 0.03% performance drop measured in the Bjøntegaard delta bit rate (BDBR) compared with current VP9 codebase.
There has been a growing interest in using different approaches to improve the coding efficiency of modern video codec in recent years as demand for web-based video consumption increases. In this paper, we propose a model-based approach that uses texture analysis/synthesis to reconstruct blocks in texture regions of a video to achieve potential coding gains using the AV1 codec developed by the Alliance for Open Media (AOM). The proposed method uses convolutional neural networks to extract texture regions in a frame, which are then reconstructed using a global motion model. Our preliminary results show an increase in coding efficiency while maintaining satisfactory visual quality.
Encoders of AOM/AV1 codec consider an input video sequence as succession of frames grouped in Golden-Frame (GF) groups. The coding structure of a GF group is fixed with a given GF group size. In the current AOM/AV1 encoder, video frames are coded using a hierarchical, multilayer coding structure within one GF group. It has been observed that the use of multilayer coding structure may result in worse coding performance if the GF group presents consistent stillness across its frames. This paper proposes a new approach that adaptively designs the Golden-Frame (GF) group coding structure through the use of stillness detection. Our new approach hence develops an automatic stillness detection scheme using three metrics extracted from each GF group. It then differentiates those GF groups of stillness from other non-still GF groups and uses different GF coding structures accordingly. Experimental result demonstrates a consistent coding gain using the new approach.
In this paper, we propose an active learning based approach to event recognition in personal photo collections to tackle the challenges due to the weakly labeled data, and the presence of irrelevant pictures in personal photo collections. Conventional approaches relying on supervised learning can not identify the relevant samples in training albums, often leading to wrong classification. In our work, we aim to utilize the concepts of active learning to choose the most relevant samples from a collection and train a classifier. We also investigate the importance of relevant images in the event recognition process, and show how the performance degrades if all images from an album, containing the irrelevant ones, are included in the process. The experimental evaluation is carried out on a benchmark dataset composed of a large number of personal photo albums. We demonstrate that the proposed strategy yields encouraging scores in the presence of irrelevant images in personal photo collections, advancing recent leading works.
Historical Chinese character recognition has been suffering from the problem of samples labeling, not only the problem of lacking sufficient labeled training samples, but also of sample classes. So the scenario for Historical Chinese character recognition is "open set" recognition, where incomplete labeling of sample classes is present at training time, and unknown classes can be submitted to the system during testing. This paper proposes a method for open set Historical Chinese Character Recognition. For open set recognition, the features available in the training data cannot effectively characterize different kinds of unknown classes. We assume that the features which characterize unknown classes can be derived or learned from other similar data sets. We utilize an auxiliary data set combined with the open set training data set to learn good features to represent historical Chinese characters. The auxiliary data set is translated using Generative Adversarial Networks (GAN) to make sure that the translated data set is as close to the historical Chinese character dataset as possible. Then we construct a neural network for features extraction. The neural network is trained using an alternative training method with the translated auxiliary dataset and incomplete labeled historical Chinese character data set. Last, features are extracted from certain layer of the trained neural network. Unknown samples are detected using statistical modelling of the Euclidean metric between samples. Experimental results show that the proposed method is effective.
Convolutional neural networks (CNNs) have improved the field of computer vision in the past years and allow groundbreaking new and fast automatic results in various scenarios. However, the training effect of CNNs when only scarce data are available is not yet examined in detail. Transfer learning is a technique that helps overcoming training data shortage by adapting trained models to a different but related target task. We investigate the transfer learning performance of pre-trained CNN models on variably sized training datasets for binary classification problems, which resemble the discrimination between relevant and irrelevant content within a restricted context. This often plays a role in data triage applications such as screening seized storage devices for means of evidence. The evaluation of our work shows that even with a small number of training examples, the models can achieve promising performances of up to 96% accuracy. We apply those transferred models to data triage by using the softmax outputs of the models to rank unseen images according to their assigned probability of relevance. This provides a tremendous advantage in many application scenarios where large unordered datasets have to be screened for certain content.
Optical character recognition (OCR) automatically recognizes texts in an image and converts them into machine codes such as ASCII or Unicode. Compared to many research studied on OCR for other languages, recognizing Arabic language is still a challenging problem due to character connection and segmentation issues. In this work, we propose a deep-learning framework of recognizing Arabic characters based on the multi-dimensional bi-direction long short-term memory (MD-BLSTM) with connectionist temporal classification (CTC). To train this framework, we generate over one-million Arabic text-line images dataset that contains Arabic digits, basic Arabic forms with isolated shape and connected forms. To compare the results, we also measure the performance of other OCR software such as Tesseract made by Hewlett-Packard and Google Inc. Tesseract version 3 and version 4 are used. Results show that deep-learning method outperforms the conventional methods in terms of recognition error rate, although the Tesseract_3.0 system was faster.
Most sports competitions are still judged by humans; the process of judging itself is not only skill and experience demanding but also at the risk of errors and unfairness. Advances in sensing and computing technologies have found successful applications to assist human judges with the refereeing process (e.g., the wellknown Hawk-Eye system). Along this line of research, we propose to develop a computer vision (CV)-based objective synchronization scoring system for synchronized diving - a relatively young Olympic sport. In synchronized diving, subjective judgement is often difficult due to the rapidness of human motion, the limited viewing angles as well as the shortness of human memory, which inspires our development of an automatic and objective scoring system. Our CV-based scoring system consists of three components: (1) Background estimation using color and optical flow clues that can effectively segment the silhouette of both divers from the input video; (2) Feature extraction using histogram of oriented-gradients (HOG) and stick figures to obtain an abstract representation of each diver's posture invariant to body attributes (e.g., height and weight); (3) Synchronization evaluation by training a feed-forward neural network using cross-validation. We have tested the designed system on 22 diving video collected at 2012 London Olympic Games. Our experimental results have shown that CV-based approach can accurately produce synchronization scores that are close to the ones given by human judges with a MSE of as low as 0.24.