Zero-Shot Visual Generalization in Model-Based Reinforcement Learning via Latent Consistency

ICLR 2026 Conference Submission23661 Authors

20 Sept 2025 (modified: 27 Nov 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Model-based Reinforcement Learning, Visual Reinforcement Learning, Representation Learning
TL;DR: Mixed weak-to-strong augmentation and latent consistency regularization make world models generalizable to unseen visual distractions
Abstract: Model-based reinforcement learning (MBRL) has shown remarkable success in pixel-based control by planning within learned latent dynamics. However, its robustness degrades significantly when test-time observations deviate from the training distribution due to unseen distractions such as shadows, viewpoint, or background variations. In this paper, we propose **Vi**sual **G**eneralization in **MO**del-based RL (**ViGMO**), a novel framework that achieves zero-shot generalization to unseen visual distractions while preserving high sample efficiency. ViGMO integrates three key components: (i) a *mixed weak-to-strong augmentation* strategy to balance efficient learning with robustness, (ii) *latent-consistency learning* to enforce stable transition predictions under distribution shifts, and (iii) *encoder regularization* to preserve task-relevant features and prevent representational collapses. Extensive evaluations on the DeepMind Control suite and Robosuite with challenging unseen distractions demonstrate that ViGMO outperforms state-of-the-art model-free and model-based baselines, improving zero-shot generalization by up to $13\\%$ over the strongest baseline while maintaining the hallmark efficiency of latent-space MBRL.
Supplementary Material: zip
Primary Area: reinforcement learning
Submission Number: 23661
Loading