MA-FSAR: Multimodal Adaptation of CLIP for few-shot action recognition

Jiazheng Xing, Jian Zhao, Chao Xu, Mengmeng Wang, Guang Dai, Yong Liu, Jingdong Wang, Xuelong Li

Published: 01 Jan 2026, Last Modified: 15 Oct 2025Pattern Recognit. 2026EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Highlights•Improving CLIP’s action-related temporal and semantic representations via parameter-efficient fine-tuning.•Global Temporal Adaptation captures global motion cues efficiently through the class token.•Local Multimodal Adaptation fuses visual and FSAR-specific text tokens to model local dynamics.•A text-guided module enriches temporal and semantic representations of video prototypes.

External IDs:dblp:journals/pr/XingZXWDLWL26