SAGE: A Synchronized Action and Gaze Estimation Framework for Comprehensive Human Behavior Analysis

ICLR 2026 Conference Submission13803 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Human-Object Interaction, Action Recognition, Action Anticipation, Gaze Prediction, Gaze Anticipation
TL;DR: SAGE is a transformer-based model that jointly anticipates and recognizes human actions, object interactions, and gaze in first- and third-person views, outperforming specialized models and introducing the third-person Exo-Cook benchmark.
Abstract: Human object interactions, gaze patterns, and their anticipation are intricately linked, providing valuable insights into cognitive processes, intentions, and behavior. This paper presents a novel unified framework, SAGE (Synchronized Action and Gaze Estimation), which integrates simultaneous recognition and anticipation of both human object interaction and human gaze into a single unified end-to-end trainable model. Our approach leverages a transformer-based architecture and incorporates gaze data into spatiotemporal attention mechanisms to simultaneously predict current and future human actions and gaze behavior. We explore this bidirectional relationship between gaze and actions under different scenarios, whether requiring a close-up, detailed view (first-person) or a wider, more contextual view (third-person), making our framework versatile for various applications. Additionally, due to lack of datasets for comprehensive analysis of both human object interactions and gaze in exocentric videos, we establish a new benchmark Exo-Cook to facilitate further research in this domain. Our experiments on three benchmark datasets: VidHOI, EGTEA Gaze+ and Exo-Cook, demonstrate that the synergy between gaze and actions in the current and future frames compares favorably and even outperforms individual task specialized state-of-the-art models. By offering a holistic understanding of human actions and attention, our work paves the way for more natural and intuitive human-machine interactions and opens new avenues for applications in cognitive rehabilitation and behavior analysis.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 13803
Loading