MA-FSAR: Multimodal Adaptation of CLIP for few-shot action recognition

Published: 01 Jan 2026, Last Modified: 15 Oct 2025Pattern Recognit. 2026EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Highlights•Improving CLIP’s action-related temporal and semantic representations via parameter-efficient fine-tuning.•Global Temporal Adaptation captures global motion cues efficiently through the class token.•Local Multimodal Adaptation fuses visual and FSAR-specific text tokens to model local dynamics.•A text-guided module enriches temporal and semantic representations of video prototypes.
Loading