The development of audio-visual quality models faces a number of challenges, including the integration of audio and video sensory channels and the modeling of their interaction characteristics. Commonly, objective quality metrics estimate the quality of a single component (audio or video) of the content. Machine learning techniques, such as autoencoders, offer as a very promising alternative to develop objective assessment models. This paper studies the performance of a group of autoencoder-based objective quality metrics on a diverse set of audio-visual content. To perform this test, we use a large dataset of audio-visual content (The UnB-AV database), which contains degradations in both audio and video components. The database has accompanying subjective scores collected on three separate subjective experiments. We compare our autoencoder-based methods, which take into account both audio and video components (multi-modal), against several objective (single-modal) audio and video quality metrics. The main goal of this work is to verify the gain or loss in performance of these single-modal metrics, when tested on audio-visual sequences.
This work presents the results of a psycho-physical experiment in which a group of forty (40) human participants rated the overall quality of a set of 40 high-definition audio-visual sequences. These audio-visual sequences were impaired with audio and video types of distortions commonly encountered in an Internet-based transmission scenario. More specifically, Packet-Loss and Frame Freezing distortions were added to the video component, while Background noise, Chop, Clipping, and Echo distortions were added to the audio component. Our goal was to study how audio and visual degradations interact with each other and with the content to produce the overall audio-visual quality. An immersive experimental methodology was used to obtain more accurate observer scores. Preliminary results show that the audio and video degradations interact with each other to produce the overall audio-visual quality. For different types of audio degradations, the Clip degradation obtained slightly lower quality scores. Similarly, for the different video degradations, Framefreezing distortions were rated higher. Also, when audio degradations were combined with Packet-loss, they had a stronger impact on the audio-visual quality.