Fixation-Driven Time-Aware 3D Human Motion Forecasting in Indoor Scenes

ICLR 2026 Conference Submission12711 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Human Motion Forecasting
TL;DR: We propose a fixation-driven, time-aware framework that uses gaze fixations to localize interaction targets and predict variable-length 3D human motion in indoor scenes, improving accuracy and robustness to segmentation errors.
Abstract: Forecasting human motion in indoor scenes is crucial for collaborative robotics and embodied AI. While prior approaches have incorporated gaze implicitly or used it only to rank segmented objects, we argue that gaze, particularly fixations, offers a more intentional and spatially precise signal for predicting human intent. In this work, we introduce a fixation-driven, time-aware framework for 3D human motion forecasting that explicitly supervises a gaze network to distinguish fixations from saccades, and uses fixation-weighted vectors to not only rank candidate objects but to also localize precise interaction points, improving robustness to segmentation errors. Our contribution further includes a duration prediction module that generates variable-length motion sequences, adapting to the spatial and temporal demands of the task. We evaluate our approach on the GIMO and GTA-IM datasets to show more accurate predictions particularly in challenging scenes with small or merged objects, and varying interaction durations through variable-length motion generation. Our code will be made publicly available.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 12711
Loading