REAR: Retrieval-Augmented Egocentric Action Recognition

ICLR 2026 Conference Submission22026 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Egocentric Action Recognition, Retrieval-Augmented, Long-tail Recognition
Abstract: Egocentric Action Recognition (EAR) aims to identify fine-grained actions and interacted objects from first-person videos, forming a core task in egocentric video understanding. Despite recent progress, EAR remains challenged by limited data scale, annotation quality, and long-tailed class distributions. To address these issues, we propose REAR, a Retrieval-augmented framework for EAR that leverages external third-person (exocentric) videos as auxiliary knowledge---without requiring synchronized ego-exo pairs. REAR adopts a dual-branch architecture: one branch extracts egocentric representations, while the other retrieves semantically relevant exocentric features. These are fused via a cross-view integration module that performs staged refinement and attention-based alignment. To mitigate class imbalance, a class-adaptive selector dynamically adjusts retrieval depth based on class frequency, and independent classifiers are trained with logit-adjusted cross-entropy. Extensive experiments across three benchmarks demonstrate that REAR achieves state-of-the-art performance, with significant gains in object recognition and tail-class accuracy. Code will be released upon acceptance.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 22026
Loading