Towards Mega-pixel Embodied Vision: Foveated Ego-View Predict Long-Horizon Contextual Intent in Dexterous Manipulation

Yanbing Han; Rilyn R. Han; Gio Huh; Kevin Yang; Alan Yu; Jianan Wang; Ge Yang

Towards Mega-pixel Embodied Vision: Foveated Ego-View Predict Long-Horizon Contextual Intent in Dexterous Manipulation

Yanbing Han, Rilyn R. Han, Gio Huh, Kevin Yang, Alan Yu, Jianan Wang, Ge Yang

Published: 06 Sept 2025, Last Modified: 26 Sept 2025CoRL 2025 Robot Data WorkshopEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Eye Tracking, Dexterous Manipulation

Abstract: Manipulating objects through physical contact on a moving robot demands perception that simultaneously captures the wider spatial-temporal context as well as task-critical visuomotor details. Biological vision resolves this tension via foveation—high-acuity sensing at the point of fixation complimented by action-sensitive peripheral vision—an architecture that has evolved convergently across species. By contrast, contemporary robot systems routinely downsample high-resolution (e.g., 4K) camera streams, forcing a trade-off between the field of view and the level of detail that degrades decision-making. We present a novel data collection protocol that jointly records high-resolution RGB video, human foveation (via gaze/attention signals), and accurate hand poses during manipulation. Our analysis shows that foveated active perception from human subjects consistently predict future task landmarks hundreds of mili-seconds ahead of time, providing long-horizon contextual cues for action planning. These findings suggest that wide-field passive vision systems today will be superseded by active perception that moves towards mega-pixel, foveated architectures.

Lightning Talk Video: mp4

Submission Number: 40

Loading