Keywords: Robotic manipulation, learning from video, Unsupervised Learning, Robot Learning
TL;DR: CoMo learns better latent motion from video data as pseudo-labels to scale robot learning.
Abstract: Unsupervised learning of latent motion from Internet videos is crucial for building
generalist robots. However, existing discrete methods suffer from information
loss and struggle with complex and fine-grained dynamics. We propose CoMo,
which aims to learn more precise continuous latent motion from internet-scale
videos. CoMo employs a early temporal feature difference mechanism to prevent
shortcut learning and suppress static appearance noise. Furthermore, guided by the
information bottleneck principle, we constrain the latent motion dimensionality
to achieve a balance between retaining sufficient action-relevant information and
minimizing the inclusion of action-irrelevant background noise. Additionally, we
also introduce two effective metrics for more directly and affordably evaluating
and analyzing motion and guiding motion learning methods development: (i)
MSE of action prediction, and (ii) cosine similarity between past-to-current and
future-to-current motion embeddings. Critically, CoMo exhibits strong zero-shot
generalization, enabling it to generate effective pseudo actions for unseen videos.
The shared continuous distribution of robot action and video latent motion also
directly benefits the joint learning of unified policy. Extensive simulated and real-
world experiments show that policies co-trained with CoMo pseudo actions achieve
superior performance with both diffusion and autoregressive architectures.
Primary Area: applications to robotics, autonomy, planning
Submission Number: 12295
Loading