Adversarial Style Transfer for Robust Policy Optimization in Deep Reinforcement Learning

Adversarial Style Transfer for Robust Policy Optimization in Deep Reinforcement Learning

TMLR Paper101 Authors

17 May 2022 (modified: 17 Sept 2024)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: This paper proposes an algorithm that aims to improve generalization for reinforcement learning agents by removing overfitting to confounding features. Our approach consists of a max-min game theoretic objective. A generator transfers the style of observation during reinforcement learning. An additional goal of the generator is to perturb the observation, which maximizes the agent's probability of taking a different action. In contrast, a policy network updates its parameters to minimize the effect of such perturbations, thus staying robust while maximizing the expected future reward. Based on this setup, we propose a practical deep reinforcement learning algorithm, Adversarial Robust Policy Optimization (ARPO), to find a robust policy that generalizes to unseen environments. We evaluate our approach on Procgen and Distracting Control Suite for generalization and sample efficiency. Empirically, ARPO shows improved performance compared to a few baseline algorithms, including data augmentation.

Submission Length: Regular submission (no more than 12 pages of main content)

Changes Since Last Submission: Here is the summary of changes based on reviewers' suggestions and comments: - Added comparison with DRAC. In the Procgen Maze environment, ARPO shows better sample efficiency and reduced variance in test performance. - Incorporated rliable metrics (probability of improvement) on DCS environments. ARPO shows better performance on Walker walk variants with a 50 - 100% chance. - Added results on SAC-based ARPO and included comparison on DCS Walker walk environment. On Walker walk background distraction, SAC-based ARPO shows improvement over the base SAC algorithm. - Added ablation on varying training diversity. Results show that diversity in training tends to support improvement in generalization in the Procgen Fruitbot environment. - Added comparison with color jitter data augmentation in DCS Walker walk environment.

Assigned Action Editor: ~Jinwoo_Shin1

Submission Number: 101

Loading