Abstract: Counterfactual explanations in reinforcement learning (RL) aim to answer what-if questions by demonstrating sparse and minimal changes to states, resulting in the probability mass moving from one action to another. Although these explanations are effective in classification tasks that look for the presence of concepts, RL brings new challenges that counterfactual methods need to solve. These challenges include defining state similarity, avoiding out-of-distribution states, and improving discriminative power of explanations. Given a state of interest called the query state, we solve these problems by asking how long the agent can execute the query state action without incurring a negative outcome regarding the expected return. We coin this outcome-based semifactual (OSF) explanation and find the OSF state by simulating trajectories from the query state. The last state in a subtrajectory where we can take the same action as in the query state without incurring a negative outcome is the OSF state. This state is discriminative, plausible, and similar to the query state. It abstracts away unimportant action switching with little explanatory value and shows the boundary between positive and negative outcomes. Qualitatively, we show that our method explains when an agent must switch actions. As a result, it is easier to understand the agent's behavior. Quantitatively, we demonstrate that our method can increase policy performance and, at the same time, reduce how often the agent switches its action across six environments. The code and trained models are made open source.
Supplementary Material: pdf
Submission Number: 145
Loading