A visual system cannot process everything with full fidelity, nor, in a given moment, perform all possible visual tasks. Rather, it must lose some information, and prioritize some tasks over others. The human visual system has developed a number of strategies for dealing with its limited capacity. This paper reviews recent evidence for one strategy: encoding the visual input in terms of a rich set of local image statistics, where the local regions grow — and the representation becomes less precise — with distance from fixation. The explanatory power of this proposed encoding scheme has implications for another proposed strategy for dealing with limited capacity: that of selective attention, which gates visual processing so that the visual system momentarily processes some objects, features, or locations at the expense of others. A lossy peripheral encoding offers an alternative explanation for a number of phenomena used to study selective attention. Based on lessons learned from studying peripheral vision, this paper proposes a different characterization of capacity limits as limits on decision complexity. A general-purpose decision process may deal with such limits by "cutting corners" when the task becomes too complicated.
Experimental phenomenology probes the meanings and qualities that compose immediate visual experience. In contradistinction, objective methods of classical psychophysics intentionally ignore meanings and qualities, or even awareness as such. Both have their proper uses. Methods of experimental phenomenology that address "equivalence" in a more intricate sense than "visible–not visible" or "discriminable–not discriminable", require stimuli that go beyond the mere level of magnitude-like parameters and perhaps intrude into the realm of semantics. One investigates the cloud of eidolons, or lookalikes, that mentally surround any image. "Eidolon factories" are based on models of the psychogenesis of visual awareness. The intentional fuzziness of eidolons may derive from a variety of processes. We explore the effects of capricious "local sign". Elsewhere, we formally proposed explicit eidolon factories based on such notions. Here we illustrate some of the effects of capricious local sign.
Recent advances in computational models in vision science have considerably furthered our understanding of human visual perception. At the same time, rapid advances in convolutional deep neural networks (DNNs) have resulted in computer vision models of object recognition which, for the first time, rival human object recognition. Furthermore, it has been suggested that DNNs may not only be successful models for computer vision, but may also be good computational models of the monkey and human visual systems. The advances in computational models in both vision science and computer vision pose two challenges in two different and independent domains: First, because the latest computational models have much higher predictive accuracy, and competing models may make similar predictions, we require more human data to be able to statistically distinguish between different models. Thus we would like to have methods to acquire trustworthy human behavioural data fast and easy. Second, we need challenging experiments to ascertain whether models show similar input-output behaviour only near "ceiling" performance, or whether their performance degrades similar to human performance: only then do we have strong evidence that models and human observers may be using similar features and processing strategies. In this paper we address both challenges.
When dealing with movies, closing the tremendous discontinuity between low-level features and the richness of semantics in the viewers' cognitive processes, requires a variety of approaches and different perspectives. For instance when attempting to relate movie content to users' affective responses, previous work suggests that a direct mapping of audio-visual properties into elicited emotions is difficult, due to the high variability of individual reactions. To reduce the gap between the objective level of features and the subjective sphere of emotions, we exploit the intermediate representation of the connotative properties of movies: the set of shooting and editing conventions that help in transmitting meaning to the audience. One of these stylistic feature, the shot scale, i.e. the distance of the camera from the subject, effectively regulates theory of mind, indicating that increasing spatial proximity to the character triggers higher occurrence of mental state references in viewers' story descriptions. Movies are also becoming an important stimuli employed in neural decoding, an ambitious line of research within contemporary neuroscience aiming at "mindreading". In this field we address the challenge of producing decoding models for the reconstruction of perceptual contents by combining fMRI data and deep features in a hybrid model able to predict specific video object classes.
Visual attention refers to the cognitive mechanism that allows us to select and process only the relevant information arriving at our eyes. Therefore, eye movements will have a significant dependency on visual attention. Saliency models, trying to simulate visual gaze and consequently, visual attention, have been continuously developed over the last years. Color information has been shown to play an important role in visual attention, and it is used in saliency computations. However, psychophysical evidence explaining the relationship between color and saliency is lacking. The results of the experiment will be presented aiming at studying and quantifying saliency of colors of different hues and lightness specified in CIELab coordinates. In the experiment, 12 observers were asked to report the number of color patches presented at random locations on a masking gray background. Eye movements were recorded using an SMI remote eye tracking system and being used to validate the reported data. In the presentation, we will compare the reported data and visual gaze data for different colors and discuss implications for our understanding of color saliency and color processing.
Providing a natural 3D visualization is a major challenge in 3D display technologies. Although 3D displays with light-ray reconstruction have been demonstrated, displayable 3D scenes are selective because their depth-reconstruction range is restricted. Here, we attempt to expand the range virtually by introducing "depth-compressed expressions," in which the depth of 3D scenes are compressed or modified in the axial direction so that the appearances of depth-compressed scenes is kept natural for viewers. With a simulated system of an autostereoscopic 3D display with light-ray reconstruction, we investigated how large the depth range needed to be to show the depth-compressed scenes without inducing unnaturalness in viewers. Using a linear depthcompression method—the simplest way of depth-compression—we found that viewers did not feel unnaturalness for the depthcompressed scenes that were expressed within at most half the depth range of the originals. These results gave us a design goal in developing 3D displays for high quality 3D visualization.
Blind image quality assessment (BIQA) of distorted stereoscopic pairs without referring to the undistorted source is a challenging problem, especially when the distortions in the left- and right-views are asymmetric. Existing studies suggest that simply averaging the quality of the left- and right-views well predicts the quality of symmetrically distorted stereoscopic images, but generates substantial prediction bias when applied to asymmetrically distorted stereoscopic images. In this study, we propose a binocular rivalry inspired multi-scale model to predict the quality of stereoscopic images from that of the single-view images without referring to the original left- and right-view images. We apply this blind 2D-to-3D quality prediction model on top of ten stateof-the-art base 2D-BIQA algorithms for 3D-BIQA. Experimental results show that the proposed 3D-BIQA model, without explicitly identifying image distortion types, successfully eliminates the prediction bias, leading to significantly improved quality prediction performance. Among all the base 2D-BIQA algorithms, BRISQUE and M3 archive excellent tradeoffs between accuracy and complexity.
We evaluate improvements to image utility assessment algorithms with the inclusion of saliency information, as well as the saliency prediction performance of three saliency models based on successful utility estimators. Fourteen saliency models were incorporated into several utility estimation algorithms, resulting in significantly improved performance in some cases, with RMSE reductions of between 3 and 25%. Algorithms designed for utility estimation benefit less from the addition of saliency information than those originally designed for quality estimation, suggesting that estimators designed to measure utility also measure some degree of saliency information, and that saliency is important for utility estimation. To test this hypothesis, three saliency models are created from NICE and MS-DGU utility estimators by convolving logical maps of image contours with a Gaussian function. The performance of these utility-based models reveals that highlyperforming utility estimation algorithms can also predict saliency to an extent, reaching approximately 77% of the prediction performance of state-of-the-art saliency models when evaluated on two common saliency datasets.
Psychovisual rate-distortion optimization (Psy-RD) has been used in the industrial video coding practice as a tool to improve perceptual video quality. It has earned significant popularity through the wide spread of the open source x264 video encoders, where the Psy-RD option is employed by default. Nevertheless, little work has been dedicated to validate the impact of Psy-RD optimization on perceptual quality, so as to provide meaningful guidance on the practical usage and future development of the idea. In this work, we build a database that contains Psy-RD encoded video sequences at different strength and bitrates. A subjective user study is then conducted to evaluate and compare the quality of the Psy-RD encoded videos. We observe that there is considerable agreement between subjects' opinions on the test video sequences. Unfortunately, the impact of Psy-RD optimization on video quality does not appear to be encouraging. Somewhat surprisingly, the perceptual quality gain of Psy-RD ON versus Psy-RD OFF cases is negative on average. Our results suggest that Psy-RD optimization should be used with caution. Further investigations show that most state-of-the-art full-reference objective quality models correlate well with the subjective experiment results overall. But in terms of the paired comparison between Psy-RD ON and OFF cases, the false alarm rates are moderately high.