Keywords: cross-embodiment transfer, visual representation learning, imitation learning, manipulation
Abstract: Effective robot learning requires diverse and realistic data, yet collecting such data is expensive and often embodiment-specific. As a result, existing datasets are fragmented across embodiments. While prior work has explored cross-embodiment transfer, it remains challenging due to the large visual gap between robots, especially under third-person viewpoints. We propose UniLatent, a cross-embodiment transfer framework based on observation alignment. UniLatent renders motion-aligned views of different robots in simulation and aligns their visual encoders so that embodiment-specific observations map into a shared latent space. Policies trained in this latent space transfer efficiently across robots without explicit pixel-level translation. Across simulation and real-world experiments, UniLatent outperforms pixel-translation baselines by over 30\% on average and enables effective few-shot real-world transfer.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 36
Loading