From Gaze to Action: Leveraging Affordance Grounding for Human Intention Understanding

Published: 07 May 2025, Last Modified: 07 May 2025ICRA Workshop Human-Centered Robot LearningEveryoneRevisionsBibTeXCC BY 4.0
Workshop Statement: Our work explores gaze-affordance grounding to decode human intention, with the aim of improving robot perception and planning in assistive robotics. Given the rapid research in foundational models (LLMs, VLMs, VLAs) and increasing availability of egocentric human recordings we believe there is a strong case for learning directly from natural human behaviour - i.e. natural interaction with the environment - both visual fixations and human-object interaction. This can help humans and robots to understand human behaviour better which in turn can be leveraged for more efficient human-robot interaction - for example our current goal is to leverage the improved understanding of human gaze for Zero-UI gaze-driven assistive robotics.
Keywords: Gaze tracking, affordance grounding
TL;DR: Can we combine gaze tracking and affordance grounding to decode human intention? Here we describe our initial methodology.
Abstract: We present an early-stage investigation into gaze-driven intention recognition for assistive robotics, with the goal of overcoming persistent challenges in the Midas Touch Problem—distinguishing intention gaze from inspection gaze—and Intention Decoding—inferring the user’s goal from gaze. Our prior work addressed these challenges separately using supervised classifiers and action-grammars or LLMs, but limitations remain in generalizability and ambiguity resolution. Here, we propose a unified approach inspired by affordance grounding—the visual identification of object regions responsible for specific actions. We study whether humans fixate on affordance regions prior to interaction and whether this signal can be used to infer intention in real-world scenarios. Using large egocentric datasets, we analyze gaze-object-action relationships across time. We benchmark automated annotations against human-labeled data, assess the applicability of existing affordance models (e.g., LOCATE, OOAL) in egocentric settings, and explore models' capacity to resolve ambiguous intentions. Our work offers insights into integrating gaze, affordance, and language models for more robust human-in-the-wild intention decoding.
Submission Number: 23
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview