Uni-directional Blending: Learning Robust Representations for Few-shot Action Recognition with Frame-level Ambiguities

Uni-directional Blending: Learning Robust Representations for Few-shot Action Recognition with Frame-level Ambiguities

ICLR 2026 Conference Submission16750 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Few-shot Learning, Prototype Learning, Vision–Language Alignment, Frame-level Ambiguity, Uni-directional Blending, Learnable Text Query, Semantic Alignment

TL;DR: We propose uni-directional blending with learnable text queries and semantic bridging loss to enhance robust representations in few-shot action recognition, mitigating frame-level ambiguities and achieving 6.5% top-1 accuracy gain on HMDB51.

Abstract: Leveraging vision-language models (VLMs) for few-shot action recognition has shown promising results, yet direct image-text alignment methods, such as CLIP, encounter significant challenges in video domains due to frame-level ambiguities. Videos frequently include irrelevant and redundant frames, leading to intra-class ambiguity from non-essential content within the same action and inter-class ambiguity from visually overlapping elements across classes. These ambiguities hinder the learning of distinctive prototypes and robust semantic representations. To overcome this, we introduce Uni-FSAR, a novel framework that employs uni-directional blending to selectively integrate relevant frames, preventing contamination of prototypes by irrelevant visual noise. Additionally, a learnable text query (LTQ) bridges the semantic gap between visual features and class labels, enhancing representation alignment. Furthermore, our LTQ-based Semantic Bridging Loss promotes focus on informative frames through similarity-based gradient propagation, mitigating inter-class overlap and fostering more generalizable representations. Extensive experiments, including cross-dataset evaluations, demonstrate that Uni-FSAR achieves superior robustness in handling frame-level ambiguities compared to prior works. Quantitatively and qualitatively, our method outperforms the state-of-the-art by an average of 2.34% across benchmarks, with a notable 6.5% top-1 accuracy gain on HMDB51, where ambiguities are most pronounced.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 16750

Loading