Visuo-Tactile World Models

Visuo-Tactile World Models

ICLR 2026 Conference Submission21189 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: world models, robotics, tactile sensing

TL;DR: Visuo-Tactile World Models (VT-WM) combine vision and touch to capture contact dynamics, yielding more faithful imagination and up to 35% higher zero-shot planning success in real robot manipulation.

Abstract: We introduce multi-task Visuo-Tactile World Models (VT-WM), which capture the physics of contact through touch reasoning. By complementing vision with tactile sensing, VT-WM better understands robot–object interactions in contact-rich tasks, avoiding common failure modes of vision-only models under occlusion or ambiguous contact states, such as objects disappearing, teleporting, or moving in ways that violate basic physics. Trained across a set of contact-rich manipulation tasks, VT-WM improves physical fidelity in imagination, achieving 33\% better performance at maintaining object permanence and 29\% better compliance with the laws of motion in autoregressive rollouts. Moreover, experiments show that grounding in contact dynamics also translates to planning. In zero-shot real-robot experiments, VT-WM achieves up to 35\% higher success rates, with the largest gains in multi-step, contact-rich tasks. Finally, VT-WM shows data efficiency when targeting a new task, outperforming a behavioral cloning policy by over 3.5$\times$ in success rate with limited demonstrations.

Supplementary Material: pdf

Primary Area: applications to robotics, autonomy, planning

Submission Number: 21189

Loading