Keywords: Robust reinforcement learning
TL;DR: We propose IWOCS, a method for robust MDPs that finds worst-case transitions, separates policy optimization from adversarial dynamics, and matches state-of-the-art deep RL performance.
Abstract: Designing control policies whose performance level is guaranteed to remain above a given
threshold in a span of environments is a critical feature for the adoption of reinforcement learning
(RL) in real-world applications. The search for such robust policies is a notoriously difficult
problem, related to the so-called dynamic model of transition function uncertainty, where the
environment dynamics are allowed to change at each time step. But in practical cases, one
is rather interested in robustness to a span of static transition models throughout interaction
episodes. The static model is known to be harder to solve than the dynamic one, and seminal
algorithms, such as robust value iteration, as well as most recent works on deep robust RL, build
upon the dynamic model. In this work, we propose to revisit the static model. We suggest an
analysis of why solving the static model under some mild hypotheses is a reasonable endeavor,
based on an equivalence with the dynamic model, and formalize the general intuition that
robust MDPs can be solved by tackling a series of static problems. We introduce a generic
meta-algorithm called IWOCS, which incrementally identifies worst-case transition models so
as to guide the search for a robust policy. Discussion on IWOCS sheds light on new ways to
decouple policy optimization and adversarial transition functions and opens new perspectives
for analysis. We derive a deep RL version of IWOCS and demonstrate it is competitive with
state-of-the-art algorithms on classical benchmarks.
Primary Area: reinforcement learning
Submission Number: 16724
Loading