Imitation from Videos: Monocular 3D Motion Estimation for Agile Quadruped Locomotion

Published: 08 May 2026, Last Modified: 08 May 2026ICRA 2026 Workshop RL4IL PosterEveryoneRevisionsCC BY 4.0
Keywords: Quadrupedal robot locomotion, imitation learning, monocular video, 3D pose estimation, graph neural network
Abstract: Agile and explosive motions are extremely challenging for quadrupedal robots because rewards for aggressive behavior are complex to design and tune. Motion-capture can provide 3D reference motions, but it is costly and requires specialized equipment and annotation. Video-based learning is cheaper yet monocular videos offer only 2D pixels; during fast actions, appearance changes cause joint tracking failures and discontinuous trajectories that hinder locomotion learning. We present a video-to-motion framework. Robust 2D pose estimation and tracking build an undirected skeleton graph and fuse joint observations using a Kalman filter. A Spatial-Temporal Graph Convolutional Network aggregates spatial pose features via graph convolution and temporal dynamics via dilated temporal convolution to reconstruct 3D joint trajectories. The motions are retargeted to the robot joint space and learned with generative imitation learning. Deployed on a quadruped, the robot acquires gallop, tripod, bipedal, and backflip, reaching up to 3.5 m/s while tracking commands. Supplementary video: \url{https://youtu.be/SGf0Nkx8t9A?si=cI098unO6MZ2Kpfv}
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 23
Loading