Back to articles
Proceedings Paper
Volume: 37 | Article ID: IMAGE-263
Image
ZAR: Zero-shot Action Recognition with Dynamic Prompt Tuning
  DOI :  10.2352/EI.2025.37.8.IMAGE-263  Published OnlineFebruary 2025
Abstract
Abstract

Pre-trained vision-language models, exemplified by CLIP, have exhibited promising zero-shot capabilities across various downstream tasks. Trained on image-text pairs, CLIP is naturally extendable to video-based action recognition, due to the similarity between processing images and video frames. To leverage this inherent synergy, numerous efforts have been directed towards adapting CLIP for action recognition tasks in videos. However, the specific methodologies for this adaptation remain an open question. Common approaches include prompt tuning and fine-tuning with or without extra model components on video-based action recognition tasks. Nonetheless, such adaptations may compromise the generalizability of the original CLIP framework and also necessitate the acquisition of new training data, thereby undermining its inherent zero-shot capabilities. In this study, we propose zero-shot action recognition (ZAR) by adapting the CLIP pre-trained model without the need for additional training datasets. Our approach leverages the entropy minimization technique, utilizing the current video test sample and augmenting it with varying frame rates. We encourage the model to make consistent decisions, and use this consistency to dynamically update a prompt learner during inference. Experimental results demonstrate that our ZAR method achieves state-of-the-art zero-shot performance on the Kinetics-600, HMDB51, and UCF101 datasets.

Subject Areas :
Views 1
Downloads 0
 articleview.views 1
 articleview.downloads 0
  Cite this article 

Qiyue Liang, Cheng Lu, Chun Tao, Jan P. Allebach, "ZAR: Zero-shot Action Recognition with Dynamic Prompt Tuningin Electronic Imaging,  2025,  pp 263-1 - 263-10,  https://doi.org/10.2352/EI.2025.37.8.IMAGE-263

 Copy citation
  Copyright statement 
Copyright © 2025, Society for Imaging Science and Technology
ei
Electronic Imaging
2470-1173
2470-1173
Society for Imaging Science and Technology
IS&T 7003 Kilworth Lane, Springfield, VA 22151 USA