Egocentric Video Understanding through Latent Action Representations

15 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: latent action representation, action prediction, action understanding, video understanding
TL;DR: A novel framework that integrates latent action represen ations, object detection, and vision embedding fusion to enhance egocentric action tasks, especially in verb aspect.
Abstract: We study action understanding task in egocentric videos, a task crucial for intelligent systems interacting with dynamic environments, such as assistive robots and augmented reality interfaces. This task requires capturing fine-grained, temporally localized interactions, which we call the action dynamics. Existing approaches often struggle to jointly model the interplay between object appearance and motion cues, limiting their ability to anticipate future actions. To address this, we propose LAF (Latent Action Fusion), a multi-modal Transformer-based framework for egocentric action anticipation and recognition. Our method extracts compact and interpretable latent action tokens from sequential video frames using a latent action model, constructed by VQ-VAE paradigm and action-conditioned frame reconstruction method for action dynamic measuring. Generated latent action tokens then fuse these tokens with embeddings from pretrained vision encoders and object detectors. The resulting multi-modal representation encodes object, interaction, spatial, and temporal information, enabling modeling of complex temporal dynamics and improving verb-level reasoning. Experiments on large-scale egocentric video datasets demonstrate that LAF shows the usefulness in action recognition and significantly enhances action anticipation (Top-5 mAP: N 24.11 → 31.02; N–V 10.62 → 14.34), highlighting the benefits of integrating latent action representations with multi-modal embeddings for precise verb aspect understanding.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 5725
Loading