Variational Adversarial Training Towards Policies with Improved Robustness
Reinforcement learning (RL), while being the benchmark for policy formulation, often struggles to deliver robust solutions across varying scenarios, leading to marked performance drops under environmental perturbations.~Traditional adversarial training, based on a two-player max-min game, is known to bolster the robustness of RL agents, but it faces challenges: first, the complexity of the worst-case optimization problem may induce over-optimism, and second, the choice of a specific set of potential adversaries might lead to over-pessimism by considering implausible scenarios. In this work, we first observe that these two challenges do not balance out each other. Thus, we propose to apply variational optimization to optimize over the worst-case distribution of the adversary instead of a single worst-case adversary. Moreover, to counteract over-optimism, we train the RL agent to maximize the lower quantile of the cumulative rewards under worst-case adversary distribution. Our novel algorithm demonstrates a significant advancement over existing robust RL methods, corroborating the importance of the identified challenges and the effectiveness of our approach. To alleviate computational overhead associated with the proposed approach, we also propose a simplified version with lower computational burden and only minimal performance degradation. Extensive experiments validate that our approaches consistently yield policies with superior robustness.