Abstract: This paper proposes an algorithm that aims to improve generalization for reinforcement learning agents by removing overfitting to confounding features. Our approach consists of a max-min game theoretic objective. A generator transfers the style of observation during reinforcement learning. An additional goal of the generator is to perturb the observation, which maximizes the agent's probability of taking a different action. In contrast, a policy network updates its parameters to minimize the effect of such perturbations, thus staying robust while maximizing the expected future reward. Based on this setup, we propose a practical deep reinforcement learning algorithm, Adversarial Robust Policy Optimization (ARPO), to find a robust policy that generalizes to unseen environments. We evaluate our approach on Procgen and Distracting Control Suite for generalization and sample efficiency. Empirically, ARPO shows improved performance compared to a few baseline algorithms, including data augmentation.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: Here is the summary of changes based on reviewers' suggestions and comments:
- Added comparison with DRAC. In the Procgen Maze environment, ARPO shows better sample efficiency and reduced variance in test performance.
- Incorporated rliable metrics (probability of improvement) on DCS environments. ARPO shows better performance on Walker walk variants with a 50 - 100% chance.
- Added results on SAC-based ARPO and included comparison on DCS Walker walk environment. On Walker walk background distraction, SAC-based ARPO shows improvement over the base SAC algorithm.
- Added ablation on varying training diversity. Results show that diversity in training tends to support improvement in generalization in the Procgen Fruitbot environment.
- Added comparison with color jitter data augmentation in DCS Walker walk environment.
Assigned Action Editor: ~Jinwoo_Shin1
Submission Number: 101
Loading