Learning Robust Representations for Visual Reinforcement Learning via Task-Relevant Mask Sampling

Published: 18 Sept 2025, Last Modified: 18 Sept 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Humans excel at isolating relevant information from noisy data to predict the behavior of dynamic systems, effectively disregarding non-informative, temporally-correlated noise. In contrast, existing visual reinforcement learning algorithms face challenges in generating noise-free predictions within high-dimensional, noise-saturated environments, especially when trained on world models featuring realistic background noise extracted from natural video streams. We propose Task Relevant Mask Sampling (TRMS), a novel approach for identifying task-specific and reward-relevant masks. TRMS utilizes existing segmentation models as a masking prior, which is subsequently followed by a mask selector that dynamically identifies subset of masks at each timestep, selecting those most probable to contribute to task-specific rewards. To mitigate the high computational cost associated with these masking priors, a lightweight student network is trained in parallel. This network learns to perform masking independently and replaces the Segment Anything Model~(SAM)-based teacher network after a brief initial phase (<10-25% of total training). TRMS enhances the generalization capabilities of Soft Actor-Critic agents under distractions, achieves better performance on the RL-Vigen benchmark, which includes challenging variants of the DeepMind Control Suite, Dexterous Manipulation and Quadruped Locomotion tasks.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: -Corrected the asterisk notation in the figure. -Refined overall formatting for improved consistency and presentation.
Supplementary Material: zip
Assigned Action Editor: ~Zhihui_Zhu1
Submission Number: 4857
Loading