Back to articles
Article
Volume: 35 | Article ID: COIMG-173
Image
Multimodal contrastive learning for unsupervised video representation learning
  DOI :  10.2352/EI.2023.35.14.COIMG-173  Published OnlineJanuary 2023
Abstract
Abstract

In this paper, we propose a multimodal unsupervised video learning algorithm designed to incorporate information from any number of modalities present in the data. We cooperatively train a network corresponding to each modality: at each stage of training, one of these networks is selected to be trained using the output of the other networks. To verify our algorithm, we train a model using RGB, optical flow, and audio. We then evaluate the effectiveness of our unsupervised learning model by performing action classification and nearest neighbor retrieval on a supervised dataset. We compare this triple modality model to contrastive learning models using one or two modalities, and find that using all three modalities in tandem provides a 1.5% improvement in UCF101 classification accuracy, a 1.4% improvement in R@1 retrieval recall, a 3.5% improvement in R@5 retrieval recall, and a 2.4% improvement in R@10 retrieval recall as compared to using only RGB and optical flow, demonstrating the merit of utilizing as many modalities as possible in a cooperative learning model.

Subject Areas :
Views 82
Downloads 27
 articleview.views 82
 articleview.downloads 27
  Cite this article 

Anup Hiremath, Avideh Zakhor, "Multimodal contrastive learning for unsupervised video representation learningin Electronic Imaging,  2023,  pp 173-1 - 173-6,  https://doi.org/10.2352/EI.2023.35.14.COIMG-173

 Copy citation
  Copyright statement 
Copyright © 2023, Society for Imaging Science and Technology 2023
ei
Electronic Imaging
2470-1173
2470-1173
Society for Imaging Science and Technology
IS&T 7003 Kilworth Lane, Springfield, VA 22151 USA