Compared to low-level saliency, higher-level information better predicts human eye movement in static images. In the current study, we tested how both types of information predict eye movements while observers view videos. We generated multiple eye movement prediction maps based on low-level saliency features, as well as higher-level information that requires cognition, and therefore cannot be interpreted with only bottom-up processes. We investigated eye movement patterns to both static and dynamic features that contained either lowor higher-level information. We found that higher-level object-based and multi-frame motion information better predict human eye movement patterns than static saliency and two-frame motion information, and higher-level static and dynamic features provide equally good predictions. The results suggest that object-based processes and temporal integration of multiple video frames are essential to guide human eye movements during video viewing.
Human behavior often consists of a series of distinct activities, each characterized by a unique pattern of interaction with the visual environment. This is true even in a restricted domain, such as a pilot flying an airplane; in this case, activities with distinct visual signatures might be things like communicating, navigating, monitoring, etc. We propose a novel analysis method for gaze-tracking data, to perform blind discovery of these hypothetical activities. The method is in some respects analogous to recurrence analysis, which has previously been applied to eye movement data. In the present case, however, we compare not individual fixations, but groups of fixations aggregated over a fixed time interval (t). We assume that the environment has been divided into a finite set of discrete areas-of-interest (AOIs). For a given time interval, we compute the proportion of time spent fixating each AOI, resulting in an N-dimensional vector, where N is the number of AOIs. These proportions can be converted to integer counts by multiplying by t divided by the average fixation duration, a parameter that we fix at 283 milliseconds. We compare different intervals by computing the chi-squared statistic. The p-value associated with the statistic is the likelihood of observing the data under the hypothesis that the data in the two intervals were generated by a single process with a single set of probabilities governing the fixation of each AOI. We cluster the intervals, first by merging adjacent intervals that are sufficiently similar, optionally shifting the boundary between non-merged intervals to maximize the difference. Then we compare and cluster non-adjacent intervals. The method is evaluated using synthetic data generated by a hand-crafted set of activities. While the method generally finds more activities than put into the simulation, we have obtained agreement as high as 80% between the inferred activity labels and ground truth.
Human visual acuity strongly depends on environmental conditions. One of the most important physical parameters affecting its value is the pupil diameter, which follows changes in the surrounding illumination by adaptation. Thus, the direct measurement of its influence on visual performance would require either medicaments or inconvenient apertures placed in front of the subjects? eyes to examine different pupil sizes, so it has not been studied in detail yet. In order to analyze this effect directly, without any external intervention, we accomplished simulations by our complex neuro-physiological vision model. It considers subjects as ideal observers limited by optical and neural filtering, as well as neural noise, and represents character recognition by templatematching. Using the model, we reconstructed the monocular visual acuity of real subjects with optical filtering calculated from the measured wavefront aberration of their eyes. According to our simulations, 1 mm alteration in the pupil diameter causes 0.05 logMAR change in the visual acuity value on average. Our result is in good agreement with former clinical experience derived indirectly from measurements that had independently analyzed the effect of background illumination on pupil size and on visual quality.
As a biologically inspired guess, we consider two stereo information channels. One is the traditional channel that works on the basis of the horizontal disparity between the left and right projections of single points in the 3D scene; this channel carries information regarding the absolute depth of the point. The second channel works on the basis of the projections of pairs of points in the 3D scene and carries information regarding the relative depth of the points; equivalently, for a given azimuth disparity of the points, the channel carries information of the ratio of the orientations of the left and right projections of the line segment between the pair of points.
Quantization of images containing low texture regions, such as sky, water or skin, can produce banding artifacts. As the bitdepth of each color channel is decreased, smooth image gradients are transformed into perceivable, wide, discrete bands. Commonly used quality metrics cannot reliably measure the visibility of such artifacts. In this paper we introduce a visual model for predicting the visibility of both luminance and chrominance banding artifacts in image gradients spanning between two arbitrary points in a color space. The model analyzes the error introduced by quantization in the Fourier space, and employs a purpose-built spatio-chromatic contrast sensitivity function to predict its visibility. The output of the model is a detection probability, which can be then used to compute the minimum bit-depth for which banding artifacts are just-noticeable. We demonstrate that the model can accurately predict the results of our psychophysical experiments.
In this paper we introduce two new no-reference metrics and compare their performance to state-of-the-art metrics on six publicly available datasets having a large variety of distortions and characteristics. Our two metrics, based on neural networks, combine the following features: histogram of oriented gradients, edges detection, fast fourier transform, CPBD, blur and contrast measurement, temporal information, freeze detection, BRISQUE and Video BLIINDS. They perform better than Video BLIINDS and BRISQUE on the six datasets used in this study, including one made up of natural videos that have not been artificially distorted. Our metrics show a good generalization as they achieved high performance on the six datasets.
Objective quality assessment of compressed images is very useful in many applications. In this paper we present an objective quality metric that is better tuned to evaluate the quality of images distorted by compression artifacts. A deep convolutional neural networks is used to extract features from a reference image and its distorted version. Selected features have both spatial and spectral characteristics providing substantial information on perceived quality. These features are extracted from numerous randomly selected patches from images and overall image quality is computed as a weighted sum of patch scores, where weights are learned during training. The model parameters are initialized based on a previous work and further trained using content from a recent JPEG XL call for proposals. The proposed model is then analyzed on both the above JPEG XL test set and images distorted by compression algorithms in the TID2013 database. Test results indicate that the new model outperforms the initial model, as well as other state-of-the-art objective quality metrics.
720p, Full-HD, 4K, 8K, …, display resolutions are increasing heavily over the past time. However, many video streaming providers are currently streaming videos with a maximum of 4K/UHD-1 resolution. Considering that normal video viewers are enjoying their videos in typical living rooms, where viewing distances are quite large, the question arises if more resolution is even recognizable. In the following paper we will analyze the problem of UHD perceptibility in comparison with lower resolutions. As a first step, we conducted a subjective video test, that focuses on short uncompressed video sequences and compares two different testing methods for pairwise discrimination of two representations of the same source video in different resolutions. We selected an extended stripe method and a temporal switching method. We found that the temporal switching is more suitable to recognize UHD video content. Furthermore, we developed features, that can be used in a machine learning system to predict whether there is a benefit in showing a given video in UHD or not. Evaluating different models based on these features for predicting perceivable differences shows good performance on the available test data. Our implemented system can be used to verify UHD source video material or to optimize streaming applications.
The appropriate characterization of the test material, used for subjective evaluation tests and for benchmarking image and video processing algorithms and quality metrics, can be crucial in order to perform comparative studies that provide useful insights. This paper focuses on the characterisation of 360-degree images. We discuss why it is important to take into account the geometry of the signal and the interactive nature of 360-degree content navigation, for a perceptual characterization of these signals. Particularly, we show that the computation of classical indicators of spatial complexity, commonly used for 2D images, might lead to different conclusions depending on the geometrical domain used to represent the 360-degree signal. Finally, new complexity measures based on the analysis of visual attention and content exploration patterns are proposed.