Towards Mega-pixel Embodied Vision: Foveated Ego-View Predict Long-Horizon Contextual Intent in Dexterous Manipulation
Keywords: Eye Tracking, Dexterous Manipulation
Abstract: Manipulating objects through physical contact on a moving robot demands perception that simultaneously captures the wider spatial-temporal context as well as task-critical visuomotor details.
Biological vision resolves this tension via foveation—high-acuity sensing at the point of fixation complimented by action-sensitive peripheral vision—an architecture that has evolved convergently across species.
By contrast, contemporary robot systems routinely downsample high-resolution (e.g., 4K) camera streams, forcing a trade-off between the field of view and the level of detail that degrades decision-making.
We present a novel data collection protocol that jointly records high-resolution RGB video, human foveation (via gaze/attention signals), and accurate hand poses during manipulation. Our analysis shows that foveated active perception from human subjects consistently predict future task landmarks hundreds of mili-seconds ahead of time, providing long-horizon contextual cues for action planning. These findings suggest that wide-field passive vision systems today will be superseded by active perception that moves towards mega-pixel, foveated architectures.
Lightning Talk Video: mp4
Submission Number: 40
Loading