Task-Relevant Language-conditioned Segmentation for Robust Generalization in Reinforcement Learning

04 May 2026 (modified: 06 May 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Humans possess a remarkable ability to filter out irrelevant sensory clutter, extracting only the information needed to anticipate and act within dynamic environments. Prior attempts to mitigate this through augmentation and masking strategies have improved robustness, but remain limited by computational overhead, weak semantic grounding, or instability in actor-critic training. Inspired by how language guides human perception, we introduce Task Relevant Language-conditioned Segmentation (TaLaS), a framework that leverages language-conditioned segmentation to impose semantic structure on visual observations. TaLaS employs a two-phase design where in the first phase, a lightweight masker is pretrained on unaugmented, language-guided masks; in the second phase, a student masker is regularized with strong augmentations to enforce consistency. This yields a task-relevant feature extractor that improves policy stability and removes the need for online segmentation at inference time. To address the actor’s deployment distribution shift, we employ asymmetric actor-critic training. TaLaS improves robustness to distractors and achieves particularly strong performance under challenging visual shifts on RL-ViGen, while remaining competitive in easier settings. The benchmark includes challenging variants of the DeepMind Control Suite, Quadruped Locomotion and Dexterous Manipulation tasks. https://talas-rl.github.io/.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Li_Dong1
Submission Number: 8759
Loading