Self-Supervised Multisensory Pretraining for Contact-Rich Robot Reinforcement Learning

Rickmer Krohn; Vignesh Prasad; Gabriele Tiboni; Georgia Chalvatzaki

Self-Supervised Multisensory Pretraining for Contact-Rich Robot Reinforcement Learning

Rickmer Krohn, Vignesh Prasad, Gabriele Tiboni, Georgia Chalvatzaki

Published: 17 Jul 2025, Last Modified: 07 Oct 2025EWRL 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Sensor Fusion, Reinforcement Learning, Multisensory Learning

TL;DR: Multisensory Pretraining leads to a rich representation for Contact-Rich Robot Reinforcement Learning

Abstract: An important element of learning contact-rich manipulation for robots is leveraging the synergy between heterogeneous sensor modalities such as vision, force, and proprioception while adapting to sensory perturbations and dynamic changes. In such multisensory settings, Reinforcement Learning (RL) faces caveats arising from varying sensory feature distributions and their changing importance depending on the task phase. To this end, taking a leaf out of Multimodal representationlearning, pretraining yields itself as a natural way to learn robust cross-modal feature representations useful for different downstream tasks. In this work, we propose MultiSensory Dynamic Pretraining (MSDP), a novel framework for learning multisensory representations tailored for task-oriented policy learning via masked autoencoding coupled with self-supervised forward dynamics objectives to shapefeatures from multiple different sensors. Using our pretraining approach, we demonstrate how using a simple cross-attention between a learnable task-specific embedding and frozen multisensory embeddings yields consistently strong performance on downstream tasks. Our approach simplifies the handling of sensory interactions, allowing the agent to focus entirely on mastering the task. Our method demonstrates accelerated learning and robust performance under diverse perturbations, including sensor noise, and changes in object dynamics. Evaluations across multiple challenging contact-rich robot manipulation tasks showcase the effectiveness and robustness of MSDP. Our framework’s modular pretraining process supports various sensor combinations, providing a simple and robust solution for complex manipulation tasks.

Confirmation: I understand that authors of each paper submitted to EWRL may be asked to review 2-3 other submissions to EWRL.

Serve As Reviewer: ~Rickmer_Krohn1

Track: Regular Track: unpublished work

Submission Number: 167

Loading