Revisiting the Static Model in Robust Reinforcement Learning

Adil Zouitine; Matthieu Geist; Emmanuel Rachelson

Revisiting the Static Model in Robust Reinforcement Learning

Adil Zouitine, Matthieu Geist, Emmanuel Rachelson

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Primary Area: reinforcement learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: Reinforcement Learning, Robust MDPs

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

TL;DR: We incrementally identify static adversarial transition functions in an algorithm which finds solutions to the robust MDP problem.

Abstract: Designing control policies whose performance level is guaranteed to remain above a given threshold in a span of environments is a critical feature for the adoption of reinforcement learning (RL) in real-world applications. The search for such robust policies is a notoriously difficult problem, often cast as a two-player game, whose formalization dates back to the 1970's. This two-player game is strongly related to the so-called dynamic model of transition function uncertainty, where the environment dynamics are allowed to change at each time step. But in practical applications, one is rather interested in robustness to a span of static transition models throughout interaction episodes. The static model is known to be harder to solve than the dynamic one, and seminal algorithms, such as robust value iteration, as well as most recent works on deep robust RL, build upon the dynamic model. In this work, we propose to revisit the static model. We suggest an analysis of why solving the static model under some mild hypotheses is a reasonable endeavor, and formalize the general intuition that robust MDPs can be solved by tackling a series of static problems. We introduce a generic meta-algorithm called IWOCS, which incrementally identifies worst-case transition models so as to guide the search for a robust policy. Discussion on IWOCS sheds light on new ways to decouple policy optimization and adversarial transition functions and opens new perspectives for analysis. We derive a deep RL version of IWOCS and demonstrate it is competitive with state-of-the-art algorithms on classical benchmarks.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

Supplementary Material: zip

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 5649

Loading