Efficient Action Robust Reinforcement Learning with Probabilistic Policy Execution Uncertainty

Published: 04 Aug 2024, Last Modified: 04 Aug 2024Accepted by TMLREveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Robust reinforcement learning (RL) aims to find a policy that optimizes the worst-case performance in the face of uncertainties. In this paper, we focus on action robust RL with the probabilistic policy execution uncertainty, in which, instead of always carrying out the action specified by the policy, the agent will take the action specified by the policy with probability $1-\rho$ and an alternative adversarial action with probability $\rho$. We show the existence of an optimal policy on the action robust MDPs with probabilistic policy execution uncertainty and provide the action robust Bellman optimality equation for its solution. Based on that, we develop Action Robust Reinforcement Learning with Certificates (ARRLC) algorithm that achieves minimax optimal regret and sample complexity. Our results highlight that action-robust RL shares the same sample complexity barriers as standard RL, ensuring robust performance without additional complexity costs. Furthermore, we conduct numerical experiments to validate our approach's robustness, demonstrating that ARRLC outperforms non-robust RL algorithms and converges faster than the other action robust RL algorithms in the presence of action perturbations.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: The main changes from the first version are 1) we added more comparisons between ARRLC and robust RL methods and a table of different complexity bounds established in previous works; 2) we have modified the figures on the main page so that they contain the curve of all the models to showcase the overall comparison between different models; 3) we added more experiments: the performance of the model-free ARQ-H, and the comparison between the non-robust approach ORLC trained without an adversary, the non-robust approach when trained against the worst adversary learned by ARRLC, and ARRLC against the worst adversary learned by ARRLC. The main changes from the last version are 1) highlight the contribution that action-robust RL has the same sample complexity barriers of RL in the main page and abstract; 2) change the symbol $\widetilde{\pi}$ to $\tilde{\pi}$ which displays better when in subscript; 3) replot Figure 1 in Figure 2's form and make it more clear and more informative; 4) fix the typos in paper.
Supplementary Material: zip
Assigned Action Editor: ~Mirco_Mutti1
Submission Number: 2565
Loading