Abstract: Recent work in reinforcement learning has focused on several characteristics of learned policies that go beyond maximizing reward. These properties include fairness, explainability, generalization, and robustness. In this paper, we define offline robustness (OR), a measure of how much variability is introduced into learned policies by incidental aspects of the training procedure, such as the order of training data or the particular exploratory actions taken by agents. A training procedure has high IR when the agents it produces take very similar actions on a set of offline test data, despite variation in these incidental aspects of the training procedure. We develop an intuitive, quantitative measure of IR and calculate it for eight algorithms in three Atari environments across dozens of interventions and states. From these experiments, we find that IR varies with the amount of training and type of algorithm and that high performance does not imply high IR, as one might expect.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: In this version we do the following:
* rename interventional robustness to offline robustness
* expand the discussion of performance and robustness
* comment that the measure can be used with macro-actions (e.g. options)
* discuss policy churn in the related literature
* clarify claims throughout the paper and clean up language
Assigned Action Editor: ~Matthieu_Geist1
Submission Number: 525
Loading