Pre-training Auto-regressive Robotic Models with 4D Representations

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We introduce a novel robotics pre-training approach that leverages low-level 4D representations by tracking 3D points over time in videos to achieve effective pre-training for robotics.
Abstract: Foundation models pre-trained on massive unlabeled datasets have revolutionized natural language and computer vision, exhibiting remarkable generalization capabilities, thus highlighting the importance of pre-training. Yet, efforts in robotics have struggled to achieve similar success, limited by either the need for costly robotic annotations or the lack of representations that effectively model the physical world. In this paper, we introduce ARM4R, an **A**uto-regressive **R**obotic **M**odel that leverages low-level **4**D **R**epresentations learned from human video data to yield a better pre-trained robotic model. Specifically, we focus on utilizing 3D point tracking representations from videos derived by lifting 2D representations into 3D space via monocular depth estimation across time. These 4D representations maintain a shared geometric structure between the points and robot state representations up to a linear transformation, enabling efficient transfer learning from human video data to low-level robotic control. Our experiments show that ARM4R can transfer efficiently from human video data to robotics and consistently improves performance on tasks across various robot environments and configurations.
Lay Summary: Modern AI models have made big advances in understanding language and images by learning from huge amounts of data found online. But in robotics, we haven’t seen the same kind of success—mostly because training robots usually requires expensive, detailed data, and it is difficult to capture the detailed dynamics of the real world. Our research introduces a new approach called ARM4R that helps robots learn more effectively from human videos. Instead of relying on costly robot-specific data, we use regular video footage of people and track how their body points move in 3D over time. This creates a kind of "4D" representation (3D in space + time), that is general enough to be extended to robots as well. We show that our method helps robots perform better across a variety of tasks and setups.
Link To Code: https://github.com/Dantong88/arm4r
Primary Area: Applications->Robotics
Keywords: Auto-regressive Robotic Models, Pre-training, 4D Representations
Submission Number: 2567
Loading