In this paper, we propose a multimodal unsupervised video learning algorithm designed to incorporate information from any number of modalities present in the data. We cooperatively train a network corresponding to each modality: at each stage of training, one of these networks is selected to be trained using the output of the other networks. To verify our algorithm, we train a model using RGB, optical flow, and audio. We then evaluate the effectiveness of our unsupervised learning model by performing action classification and nearest neighbor retrieval on a supervised dataset. We compare this triple modality model to contrastive learning models using one or two modalities, and find that using all three modalities in tandem provides a 1.5% improvement in UCF101 classification accuracy, a 1.4% improvement in R@1 retrieval recall, a 3.5% improvement in R@5 retrieval recall, and a 2.4% improvement in R@10 retrieval recall as compared to using only RGB and optical flow, demonstrating the merit of utilizing as many modalities as possible in a cooperative learning model.
Self-supervised learning has been an active area of research in the past few years. Contrastive learning is a type of self-supervised learning method that has achieved a significant performance improvement on image classification task. However, there has been no work done in its application to fisheye images for autonomous driving. In this paper, we propose FisheyePixPro, which is an adaption of pixel level contrastive learning method PixPro \cite{Xie2021PropagateYE} for fisheye images. This is the first attempt to pretrain a contrastive learning based model, directly on fisheye images in a self-supervised approach. We evaluate the performance of learned representations on the WoodScape dataset using segmentation task. Our FisheyePixPro model achieves a 65.78 mIoU score, a significant improvement over the PixPro model. This indicates that pre-training a model on fisheye images have a better performance on a downstream task.