Image captioning generates text that describes scenes from input images. It has been developed for high-quality images taken in clear weather. However, in bad weather conditions, such as heavy rain, snow, and dense fog, poor visibility as a result of rain streaks, rain accumulation, and snowflakes causes a serious degradation of image quality. This hinders the extraction of useful visual features and results in deteriorated image captioning performance. To address practical issues, this study introduces a new encoder for captioning heavy rain images. The central idea is to transform output features extracted from heavy rain input images into semantic visual features associated with words and sentence context. To achieve this, a target encoder is initially trained in an encoder-decoder framework to associate visual features with semantic words. Subsequently, the objects in a heavy rain image are rendered visible by using an initial reconstruction subnetwork (IRS) based on a heavy rain model. The IRS is then combined with another semantic visual feature matching subnetwork (SVFMS) to match the output features of the IRS with the semantic visual features of the pretrained target encoder. The proposed encoder is based on the joint learning of the IRS and SVFMS. It is trained in an end-to-end manner, and then connected to the pretrained decoder for image captioning. It is experimentally demonstrated that the proposed encoder can generate semantic visual features associated with words even from heavy rain images, thereby increasing the accuracy of the generated captions.
Compared to low-level saliency, higher-level information better predicts human eye movement in static images. In the current study, we tested how both types of information predict eye movements while observers view videos. We generated multiple eye movement prediction maps based on low-level saliency features, as well as higher-level information that requires cognition, and therefore cannot be interpreted with only bottom-up processes. We investigated eye movement patterns to both static and dynamic features that contained either lowor higher-level information. We found that higher-level object-based and multi-frame motion information better predict human eye movement patterns than static saliency and two-frame motion information, and higher-level static and dynamic features provide equally good predictions. The results suggest that object-based processes and temporal integration of multiple video frames are essential to guide human eye movements during video viewing.