Abstract: Highlights•Improving CLIP’s action-related temporal and semantic representations via parameter-efficient fine-tuning.•Global Temporal Adaptation captures global motion cues efficiently through the class token.•Local Multimodal Adaptation fuses visual and FSAR-specific text tokens to model local dynamics.•A text-guided module enriches temporal and semantic representations of video prototypes.
External IDs:dblp:journals/pr/XingZXWDLWL26
Loading