72010604

Electronic Imaging

2470-1173

Society for Imaging Science and Technology

10.2352/ISSN.2470-1173.2016.13.IQSP-217

2470-1173(20160214)2016:13L.1;1-

s21.phd

/ist/ei/2016/00002016/00000013/art00021

Perception and Quality

An Audiovisual Saliency Model For Conferencing and Conversation Videos

Sidaty

Naty Ould

Larabi

Mohamed-Chaker

Saadane

Abdelhakim

14 02 2016

2016 13 1 6

2016

Visual attention modeling is a very active research area. During the last decade several image and video attention models have been proposed. Unfortunately, the majority of classical video attention models do not take into account the multimodal aspect of the video (visual and auditory cues). However, several studies have proven that human gazes are affected by the presence of the soundtrack. In this paper we propose an audiovisual saliency model that can predict the human gaze maps when exploring a conferencing or conversation videos. The model is based on the fusion of spatial, temporal and auditory attentional maps. Thanks to a real-time audiovisual speaker localization method, the proposed auditory maps are modulated by the enhanced saliency region of speakers compared to the other faces in the video. Classical visual attention measures have been used to compare the predicted saliency maps with the eye-tracking ground truth. Results of the proposed approach, using several fusion methods, show a good performance whatever the used spatial models.