Keywords: Reinforcement Learning, Representation Learning, Sensorimotor Learning
TL;DR: Multisensory pretraining enhances Reinforcement Learning for contact-rich tasks by learning expressive representations through masked autoencoding.
Abstract: Effective contact-rich manipulation requires robots
to synergistically leverage vision, force, and proprioception.
However, Reinforcement Learning agents struggle to learn
in such multisensory settings, especially amidst sensory noise
and dynamic changes. We propose MultiSensory Dynamic
Pretraining (MSDP), a novel framework for learning expressive
multisensory representations tailored for task-oriented policy
learning. MSDP is based on masked autoencoding and trains
a transformer-based encoder by reconstructing multisensory
observations from only a subset of sensor embeddings, leading
to cross-modal prediction and sensor fusion. For downstream
policy learning, we introduce a novel asymmetric architecture,
where a cross-attention mechanism allows the critic to extract
dynamic, task-specific features from the frozen embeddings,
while the actor receives a stable pooled representation to guide
its actions. Our method demonstrates accelerated learning and
robust performance under diverse perturbations, including
sensor noise, and changes in object dynamics. Evaluations in
multiple challenging, contact-rich robot manipulation tasks in
simulation and the real world showcase the effectiveness of
MSDP. Our approach exhibits strong robustness to perturbations
and achieves high success rates on the real robot with as few as
6,000 online interactions, offering a simple yet powerful solution
for complex multisensory robotic control.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Video: mp4
Submission Number: 1
Loading