Keywords: Manipulation, Visual learning, Differentiable optimization
Abstract: Specifying tasks with videos is a powerful technique towards acquiring novel and general robot skills. However, reasoning over mechanics and dexterous interactions can make it challenging to scale visual learning for contact-rich manipulation. In this work, we focus on the problem of visual dexterous planar manipulation: given a video of an object in planar motion, find contact-aware robot actions that reproduce the same object motion. We propose a novel learning architecture that combines video decoding neural models with priors from contact mechanics by leveraging differentiable optimization and differentiable simulation. Through extensive simulated experiments, we investigate the interplay between traditional model-based techniques and modern deep learning approaches. We find that our modular and fully differentiable architecture outperforms learning-only methods on unseen objects and motions.
Supplementary Material: zip
Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/arxiv:2111.05318/code)