Visual attention modeling is a very active research area. During the last decade several image and video attention models have been proposed. Unfortunately, the majority of classical video attention models do not take into account the multimodal aspect of the video (visual and auditory cues). However, several studies have proven that human gazes are affected by the presence of the soundtrack. In this paper we propose an audiovisual saliency model that can predict the human gaze maps when exploring a conferencing or conversation videos. The model is based on the fusion of spatial, temporal and auditory attentional maps. Thanks to a real-time audiovisual speaker localization method, the proposed auditory maps are modulated by the enhanced saliency region of speakers compared to the other faces in the video. Classical visual attention measures have been used to compare the predicted saliency maps with the eye-tracking ground truth. Results of the proposed approach, using several fusion methods, show a good performance whatever the used spatial models.
Naty Ould Sidaty, Mohamed-Chaker Larabi, Abdelhakim Saadane, "An Audiovisual Saliency Model For Conferencing and Conversation Videos" in Proc. IS&T Int’l. Symp. on Electronic Imaging: Image Quality and System Performance XIII, 2016, https://doi.org/10.2352/ISSN.2470-1173.2016.13.IQSP-217