CoMo: Learning Continuous Latent Motion from Internet Videos for Scalable Robot Learning

18 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Robotic manipulation, learning from video, Unsupervised Learning, Robot Learning
TL;DR: CoMo learns better latent motion from video data as pseudo-labels to scale robot learning.
Abstract: Unsupervised learning of latent motion from Internet videos is crucial for building generalist robots. However, existing discrete methods suffer from information loss and struggle with complex and fine-grained dynamics. We propose CoMo, which aims to learn more precise continuous latent motion from internet-scale videos. CoMo employs a early temporal feature difference mechanism to prevent shortcut learning and suppress static appearance noise. Furthermore, guided by the information bottleneck principle, we constrain the latent motion dimensionality to achieve a balance between retaining sufficient action-relevant information and minimizing the inclusion of action-irrelevant background noise. Additionally, we also introduce two effective metrics for more directly and affordably evaluating and analyzing motion and guiding motion learning methods development: (i) MSE of action prediction, and (ii) cosine similarity between past-to-current and future-to-current motion embeddings. Critically, CoMo exhibits strong zero-shot generalization, enabling it to generate effective pseudo actions for unseen videos. The shared continuous distribution of robot action and video latent motion also directly benefits the joint learning of unified policy. Extensive simulated and real- world experiments show that policies co-trained with CoMo pseudo actions achieve superior performance with both diffusion and autoregressive architectures.
Primary Area: applications to robotics, autonomy, planning
Submission Number: 12295
Loading