The Power of the Senses: Generalizable Manipulation from Vision and Touch through Masked Multimodal Learning

23 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: applications to robotics, autonomy, planning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Representation Learning, Visual-Tactile Robotic Manipulation, Reinforcement Learning
Submission Guidelines: I certify that this submission complies with the submission instructions as described on
Abstract: Humans rely on the synergy of their senses for most essential tasks. For tasks requiring object manipulation, we seamlessly and effectively exploit the complementarity of our senses of vision and touch. This paper draws inspiration from such capabilities and aims to find a systematic approach to fuse visual and tactile information in a reinforcement learning setting. We propose Masked Multimodal Learning (M3L), which jointly learns a policy and visual-tactile representations based on masked autoencoding. The representations jointly learned from vision and touch improve sample efficiency, and unlock generalization capabilities beyond those achievable through each of the senses separately. Remarkably, representations learned in a multimodal setting also benefit vision-only policies at test time. We consider simulations provided of both visual and tactile observations, namely, a robotic insertion environment, a door opening task, and dexterous in-hand manipulation, demonstrating the benefits of learning a multimodal policy. Videos of the experiments are available at Code will be released upon acceptance.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 6575