EgoPAT3Dv2: Predicting 3D Action Target from 2D Egocentric Vision for Human-Robot Interaction

Published: 15 May 2026, Last Modified: 12 Nov 2025ICRAEveryoneCC BY-SA 4.0
Abstract: A robot’s ability to anticipate the 3D action target location of a hand’s movement from egocentric videos can greatly improve safety and efficiency in human-robot interaction (HRI). While previous research has primarily focused on semantic action classification or 2D target region prediction, we argue that predicting the action target’s 3D coordinates could enable more versatile downstream robotics tasks, especially given the growing prevalence of headset devices. To advance this goal, our study expands EgoPAT3D, the only dataset dedicated to egocentric 3D action target prediction, by increasing its size and diversity to enhance generalization. Additionally, we significantly improve the baseline algorithm by incorporating a large pre-trained model and human prior knowledge. Notably, our novel algorithm achieves superior prediction performance using only RGB images, removing the previous dependency on 3D point clouds and IMU input. Furthermore, we deploy this enhanced baseline on a real-world robotic platform to demonstrate its practical utility in simple HRI tasks, showcasing the real-world applicability of our approach and inspiring broader use cases involving egocentric vision. All code and data are open-sourced and available on the project website.
Loading