<!DOCTYPE article PUBLIC '-//NLM//DTD Journal Publishing DTD v2.1 20050630//EN' 'http://uploads.ingentaconnect.com/docs/dtd/ingenta-journalpublishing.dtd'>
<article article-type="research-article">
  <front>
    <journal-meta>
      <journal-id journal-id-type="aggregator">72010604</journal-id>
      <journal-title>Electronic Imaging</journal-title>
      <issn pub-type="ppub">2470-1173</issn><issn pub-type="epub"></issn>
      <publisher>
        <publisher-name>Society for Imaging Science and Technology</publisher-name>
      </publisher>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.2352/ISSN.2470-1173.2016.13.IQSP-217</article-id>
      <article-id pub-id-type="sici">2470-1173(20160214)2016:13L.1;1-</article-id>
      <article-id pub-id-type="publisher-id">s21.phd</article-id>
      <article-id pub-id-type="other">/ist/ei/2016/00002016/00000013/art00021</article-id>
      <article-categories>
        <subj-group>
          <subject>Perception and Quality</subject>
        </subj-group>
      </article-categories>
      <title-group>
        <article-title>An Audiovisual Saliency Model For Conferencing and Conversation Videos</article-title>
      </title-group>
      <contrib-group>
        <contrib>
          <name>
            <surname>Sidaty</surname>
            <given-names>Naty Ould</given-names>
          </name>
        </contrib>
        <contrib>
          <name>
            <surname>Larabi</surname>
            <given-names>Mohamed-Chaker</given-names>
          </name>
        </contrib>
        <contrib>
          <name>
            <surname>Saadane</surname>
            <given-names>Abdelhakim</given-names>
          </name>
        </contrib>
      </contrib-group>
      <pub-date>
        <day>14</day>
        <month>02</month>
        <year>2016</year>
      </pub-date>
      <volume>2016</volume>
      <issue>13</issue>
      <fpage>1</fpage>
      <lpage>6</lpage>
      <permissions>
        <copyright-year>2016</copyright-year>
      </permissions>
      <abstract>
        <p>Visual attention modeling is a very active research area. During the last decade several image and video attention models have been proposed. Unfortunately, the majority of classical video attention models do not take into account the multimodal aspect of the video (visual and auditory
 cues). However, several studies have proven that human gazes are affected by the presence of the soundtrack. In this paper we propose an audiovisual saliency model that can predict the human gaze maps when exploring a conferencing or conversation videos. The model is based on the fusion of
 spatial, temporal and auditory attentional maps. Thanks to a real-time audiovisual speaker localization method, the proposed auditory maps are modulated by the enhanced saliency region of speakers compared to the other faces in the video. Classical visual attention measures have been used
 to compare the predicted saliency maps with the eye-tracking ground truth. Results of the proposed approach, using several fusion methods, show a good performance whatever the used spatial models.</p>
      </abstract>
    </article-meta>
  </front>
</article>
