Multimodal contrastive learning for unsupervised video representation learning

Anup  Hiremath; Avideh  Zakhor

doi:10.2352/EI.2023.35.14.COIMG-173

Abstract

In this paper, we propose a multimodal unsupervised video learning algorithm designed to incorporate information from any number of modalities present in the data. We cooperatively train a network corresponding to each modality: at each stage of training, one of these networks is selected to be trained using the output of the other networks. To verify our algorithm, we train a model using RGB, optical flow, and audio. We then evaluate the effectiveness of our unsupervised learning model by performing action classification and nearest neighbor retrieval on a supervised dataset. We compare this triple modality model to contrastive learning models using one or two modalities, and find that using all three modalities in tandem provides a 1.5% improvement in UCF101 classification accuracy, a 1.4% improvement in R@1 retrieval recall, a 3.5% improvement in R@5 retrieval recall, and a 2.4% improvement in R@10 retrieval recall as compared to using only RGB and optical flow, demonstrating the merit of utilizing as many modalities as possible in a cooperative learning model.

Electronic Imaging

2470-1173

Society for Imaging Science and Technology

IS&T 7003 Kilworth Lane, Springfield, VA 22151 USA

10.2352/EI.2023.35.14.COIMG-173

COIMG-173

Article

Multimodal contrastive learning for unsupervised video representation learning

HiremathAnup

University of California, Berkeley, United States

ZakhorAvideh

University of California, Berkeley, United States

Abstract

1612023

COIMG

Computational Imaging XXI

173-1

173-6

2023

MultimodalVideoAudioUnsupervisedMachine LearningContrastive Learning

articleview.keywords