In this paper, we propose a multimodal unsupervised video learning algorithm designed to incorporate information from any number of modalities present in the data. We cooperatively train a network corresponding to each modality: at each stage of training, one of these networks is selected to be trained using the output of the other networks. To verify our algorithm, we train a model using RGB, optical flow, and audio. We then evaluate the effectiveness of our unsupervised learning model by performing action classification and nearest neighbor retrieval on a supervised dataset. We compare this triple modality model to contrastive learning models using one or two modalities, and find that using all three modalities in tandem provides a 1.5% improvement in UCF101 classification accuracy, a 1.4% improvement in R@1 retrieval recall, a 3.5% improvement in R@5 retrieval recall, and a 2.4% improvement in R@10 retrieval recall as compared to using only RGB and optical flow, demonstrating the merit of utilizing as many modalities as possible in a cooperative learning model.
In the early phases of the pandemic lockdown, our team was eager to share our collection in new ways. Using an existing 3D asset and advancements in AR technology we were able to augment a 3D model of a collection object with the voice of a curator to add context and value. This experience leveraged the unique capabilities of the open Pixar USD format USDZ extension. This paper documents the workflow behind creating an AR experience as well as other applications of the USD/USDZ format for cultural heritage applications. This paper will also provide valuable information about developments, limitations and misconceptions between WebXR glTF and USDZ.