11 Summaries of Papers on Explainable Reinforcement Learning With Some Commentary

Introduction

Model interpretability was a bullet point in Concrete Problems in AI Safety (2016). Since then, interpretability has come to comprise entire research directions in technical safety agendas (2020). It is safe to say that interpretability is now a very popular area of research. In fact, the topic is sufficiently mainstream that there are books on the topic and corporate services promising to provide it.

Interpretability for reinforcement learning, however, has received much less attention than for supervised learning. So what’s the state of research on this topic? What does progress in interpretable RL look like, and are we making progress?

What is this post? This post summarizes 11 recent papers on explaining reinforcement learning agents (in ICLR and related conferences), then provides commentary on the research. The summaries - and not the commentary - are the main point of this post. Though people like paper summaries, this is the kind of interpretive labor that isn’t traditionally awarded space in research venues. We primarily select papers appearing between 2018 and 2020, in order to bridge the gap between foundational papers published in 2010-2017 and the more recent and diverse directions of research in the field.

How to read this post. If you want to see high-level opinions on explainable RL, just read the commentary section. If you just want our summaries, they are divided into two sections you can read: Explaining RL Agents and Evaluating the Methods Used for Explaining Agents. For a quick glance at each area, we highlight one standout paper per area. For interested readers, we also provide a list of six more recent papers after the summaries.

Recommended background. Before reading this post, we recommend familiarity with basic aspects of interpretability research, i.e. the kinds of concepts in The Mythos of Model Interpretability and Towards A Rigorous Science of Interpretable Machine Learning. We suggest looking at either of these papers if you want a primer on interpretability.

Two summaries are reproduced with permission from Rohin Shah’s Alignment Newsletter as they pre-date this post (noted when they appear).

Table of Contents

Explaining RL Agents

Here we summarize papers that introduce methods for explaining why RL agents take the actions they do.

  • Section Highlight: Explainable Reinforcement Learning Through a Causal Lens

    • AAAI 2020

    • This paper presents a series of formal definitions of what an explanation is in the context of structural causal models of an RL agent, then proposes a procedure for generating explanations of agent behavior. The authors’ goal is to develop a procedure for explaining agents’ actions themselves, rather than give explanations of why a state counts as evidence favoring some action. The definitions require some technical context, but roughly speaking: A structural causal model of an agent is a graph representing causal relationships between state, action, and reward nodes, with equations specifying each relationship in the graph. They define an action influence model as a causal graph plus a set of structural equations, with structural equations for each unique variable value and action pair (meaning multiple equations per variable value). Next, they say that (1) a complete explanation is the complete causal chain from an action to any future reward it leads to, (2) a minimally complete explanation is the set of parent nodes to an action, parent nodes to resulting rewards, and the rewards (so complete minus the nodes that aren’t parents to rewards), (3) a counterfactual instantiation for a counterfactual action B is the condition under which the model would select action B and the condition resulting from this selection given the SCM, and, lastly, (4) a minimally complete contrastive explanation is an explanation which “extracts the actual causal chain for the taken action A, and the counterfactual causal chain for the B, and finds the differences.”

      They give an example minimally complete contrastive explanation for why a Starcraft-playing agent chooses to not build barracks (from a formal explanation plugged into a natural language template): “Because it is more desirable to do action Build Supply Depot to have more Supply Depots as the goal is to have more Destroyed Units and Destroyed buildings.”

      How do they generate these explanations? They learn models of the structural equations in their action influence model, conditioned on user-specified causal graphs, by fitting models to observed gameplay by an agent. With learned structural models, they give an algorithm for predicting the action an agent will take in a given state. From here, they can get explanations in the above forms. They validate the learned structural models by checking that they can predict what agents will do. Prediction accuracies range from 68.2 to 94.7 across six games, including Starcraft and OpenAI Gym environments.

      Explanations are evaluated with a human subject experiment. They test two hypotheses: that receiving explanations will improve users’ mental models of the agents, as measured by their ability to predict what the agent will do in a given state, and that explanations will improve trust, as measured by subjective reports on a Likert-scale. There are four conditions: (1) explanations come from their full explanation system, (2) they come from their system with more granular “atomic” actions, (3) explanations are based only on relevant variables, from prior work, given in the form “Action A is likely to increase relevant variable P” and (4) no explanations. They conduct experiments on Mechanical Turk with 120 users: after a training phase where participants learn what Starcraft-playing agents are doing, they enter a learning phase where they see 5 videos and after each are allowed to ask as many questions about the agent behavior as they’d like (in the form why/why-not action X). Next, they predict what the agent will do in 8 given situations. Lastly, users complete the trust battery, rating explanations based on whether they are complete, sufficient, satisfying, and understandable.

      They find that given their explanation system, users are better able to predict agent behavior than in the “no explanation” or “relevant variables explanation” conditions. The improvement over the relevant variables condition is equivalent to getting one more action prediction correct out of 16 data points. Their results for the effect on trust are not statistically significant in all cases, but across the measured dimensions of trust their system improves ratings by between 0.3 and 1.0 points on their 5 point Likert scale.

  • Counterfactual States for Atari Agents via Generative Deep Learning

    • IJCAI XAI Workshop 2019

    • With RL agents trained on Atari games, the authors aim to produce counterfactual states for a given state that an agent is in, which are defined as the closest states that result in a different action under the policy. This is done by learning a generative model of states conditioned on latent state representations and the policy network’s distribution over actions. Then, a gradient-based search for a representation is performed to yield a different action under the policy, and a counterfactual state is generated from this representation. The authors argue that the policy model’s latent space is too high dimensional for generation out of this space to produce coherent images. Hence, they learn a Wasserstein autoencoder on the policy model’s latent space, and perform the search in this lower-dimensional space. Another training trick means that the state representations actually used for generation don’t encode any information about a preferred action, unlike those in the policy network, so that the generator will meaningfully rely on the action distribution it is given. The overall generation procedure is as follows: Given a state and an agent, they pass the state through the policy network and then through the autoencoder to get a low-dimensional representation, then perform a gradient-based search in that space for the closest representation by L2 distance that yields a user-specified counterfactual action when decoded back into the policy model’s latent space and transformed into a distribution over actions. A counterfactual state is generated conditioned on this new counterfactual distribution over actions and a representation of the original state.

      The generations are evaluated by humans for two properties: how realistic the generated images are, and self-reported subjective understanding of the observed agent. After 30 human subjects (students and local community members) play Space Invaders for 5 minutes, they are asked to rate the realism of 30 images randomly chosen from a set including real gameplay images, counterfactual generations, and images from a heavily ablated version of their model without the autoencoder. On a scale of 1 to 6, real states received a 4.97 on average, counterfactual states a 4.0, and the ablated model’s generations a 1.93. For the subjective user understanding test, participants were first shown a replay of an agent playing the game, then shown 10 pairs of states and counterfactual states (and associated actions for each), with counterfactual states selected to have large deviations from the original state. Users were asked to rate their “understanding of the agent” on a 1-6 scale before and after seeing these states. They found that 15 users’ reported understandings improved, 8 declined, and 7 were constant (with a one-sided Wilcoxon signed-rank test for improvement: p=0.098).

  • Finding and Visualizing Weaknesses of Deep Reinforcement Learning Agents

    • ICLR 2020

    • The paper proposes a method for generating states with certain properties under a policy that are intended to be helpful with analyzing the policy. In particular, they identify states with large Q-values for certain actions, like hard braking by a simulated self-driving car, a large difference between best and worst Q-values (clear-cut situations), or low Q-values across actions (hopeless situations). They note that the immediate approach to doing this, for continuous states like in the Atari games they experiment with, is activation maximization of a Q-value (or function on Q-values) with respect to the input image, but they find that in practice this produces meaningless images outside of the natural state distribution, even when a variety of tricks are used. In response, they encode states in a low-dimensional space with a VAE and perform the activation maximization by gradient ascent in this embedding space. Interestingly, they search for the parameters of a distribution over embeddings, ($\mu, \sigma$), rather than just a single embedding; later, they find that the results of the search allow them to generate samples using the VAE decoder. The VAE objective has a reconstruction loss (to generate realistic images) and a penalty on the reconstruction resulting in a different action from the original training image. They find that it is necessary to focus the reconstruction error on regions “important” to the agent, which means they weight the L2 reconstruction loss by a measure of pixel saliency obtained by applying a gradient-based saliency method to policy at a given state. The generator is trained with trajectories from a fixed agent.

      They provide a great deal of qualitative analysis using their generated states. A few highlights include: In Seaquest, where the player must resurface from below water when an oxygen tank is low, they suggest that an agent does not understand that they must resurface when low on oxygen, after optimizing states for the Q-value of resurfacing. They note that while “it would be possible to identify this flawed behavior by analyzing the 10,000 frames of training data for our generator, it is significantly easier to review a handful of samples from our method.” The generator can also yield examples not seen during training. With agents trained as simulated self-driving cars in an environment built by the authors, they find evidence of absence of the ability of a policy to avoid pedestrians: with a policy trained using “reasonable pedestrians” that never crossed while there was traffic, they observe that among states maximizing the Q-value of braking, states with pedestrians in the road are conspicuously absent. This policy shortcoming is then verified in a test environment where pedestrians cross while there is oncoming traffic, and they find that the agent will run over pedestrians.

  • Towards Interpretable Reinforcement Learning Using Attention Augmented Agents

    • NeurIPS 2019

    • The authors propose a policy network with a spatial attention mechanism and perform qualitative analysis of the attention weights to analyze agent behavior. The network has an interesting structure: at a given timestep, a query model, which is an LSTM, produces a query vector that is passed to an attention layer that takes a representation of the current state (produced by another model) as the keys and values. The single output vector is used to obtain an action and is passed back to the LSTM. They emphasize the “top-down” nature of the attention: the query network determines the attention weights for a given state representation. On experiments with Atari games, they find that this model obtains higher average rewards than baseline feed-forward or LSTM-based models. They provide qualitative analysis (including videos) of the spatial attention, and suggest that their model pays attention to task-relevant aspects of states. They also compare their attention-based analysis against saliency scores returned by an existing saliency method, for both their attentive model and a purely feed-forward baseline. Qualitative anlaysis is performed using the existing salience method, which involves perturbing an image by blurring at different points and computing the effect of each perturbation on the policy function output or value function output. They suggest that similarities between heatmaps given by attention and the salience methods are evidence that interpretation based on the attention values accurately reflects “where the agent is looking in the image.” Based on this analysis, they report apparent differences in the learned behaviors of the attentive and feed-forward models.

  • Exploratory Not Explanatory: Counterfactual Analysis of Saliency Maps for Deep Reinforcement Learning

    • ICLR 2020

    • From Alignment Newsletter #101:

      This paper presents an analysis of the use of saliency maps in deep vision-based reinforcement learning on ATARI. They consider several types of saliency methods, all of which produce heatmaps on the input image. They show that all (46 claims across 11 papers) uses of saliency maps in deep RL literature interpret them as representing the agent’s ““focus””, 87% use the saliency map to generate a claim about the agent’s behaviour or reasoning, but only 7% validate their claims with additional or more direct evidence.

      They go on to present a framework to turn subjective and under-defined claims about agent behaviour generated with saliency maps into falsifiable claims. This framework effectively makes the claim more specific and targeted at specific semantic concepts in the game’s state space. Using a fully parameterized version of the ATARI environment, they can alter the game’s state in ways which preserve meaning (i.e. the new state is still a valid game state). This allows them to perform interventions in a rigorous way, and falsify the claims made in their framework.

      Using their framework, they perform 3 experimental case studies on popular claims about agent behaviour backed up by saliency maps, and show that all of them are false (or at least stated more generally than they should be). For example, in the game Breakout, agents tend to build tunnels through the bricks to get a high score. Saliency maps show that the agent attends to these tunnels in natural games. However, shifting the position of the tunnel and/or the agent’s paddle and/or the ball all remove the saliency on the tunnel’s location. Even flipping the whole screen vertically (which still results in a valid game state) removes the saliency on the tunnel’s location. This shows that the agent doesn’t understand the concept of tunnels generally or robustly, which is often what is claimed.

  • Causal Analysis of Agent Behavior for AI Safety

    • arXiv 2021

    • From Alignment Newsletter #141:

      A common challenge when understanding the world is that it is very hard to infer causal structure from only observational data. Luckily, we aren’t limited to observational data in the case of AI systems: we can intervene on either the environment the agent is acting in, or the agent itself, and see what happens. In this paper, the authors present an “agent debugger” that helps with this, which has all the features you’d normally expect in a debugger: you can set breakpoints, step forward or backward in the execution trace, and set or monitor variables.

      Let’s consider an example where an agent is trained to go to a high reward apple. However, during training the location of the apple is correlated with the floor type (grass or sand). Suppose we now get an agent that does well in the training environment. How can we tell if the agent looks for the apple and goes there, rather than looking at the floor type and going to the location where the apple was during training?

      We can’t distinguish between these possibilities with just observational data. However, with the agent debugger, we can simulate what the agent would do in the case where the floor type and apple location are different from how they were in training, which can then answer our question.

      We can go further: using the data collected from simulations using the agent debugger, we can also build a causal model that explains how the agent makes decisions. We do have to identify the features of interest (i.e. the nodes in the causal graph), but the probability tables can be computed automatically from the data from the agent debugger. The resulting causal model can then be thought of as an “explanation” for the behavior of the agent.

Evaluating the Methods Used for Explaining Agents

Now we turn our attention to papers that evaluate methods for explaining model behaviors. We seek to provide some context from the broader interpretability literature on the explanation methods that appear in work with RL agents.

  • Section Highlight: Are Visual Explanations Useful? A Case Study in Model-in-the-Loop Prediction

    • arXiv 2020

    • The authors run an RCT to see how different model explanation approaches can help with human in-the-loop prediction, as well as trust in the model. The prediction task is on the APP-REAL dataset which consists of over 7,000 face images and age labels. The experiment has two base conditions, one where users are asked to give an age prediction, and one where users are asked to give a prediction and are also shown the model’s guess. The explanation groups were shown one of three explanations in addition to the model’s output: a saliency map from the actual model (calculated with Integrated Gradients), a saliency map from a modified dataset with spurious correlations, and a random saliency map. Before collecting data, the authors ran a two-tailed power analysis using prior guesses on the dataset. The experiment also varied the framing, with the following three modifications: (1) Delayed Prediction, which asked for a user’s guess, showed the model output, and asked for a revised user guess; (2) Empathetic, which described the model’s output in a personified way; and (3) Show Top-3 Range, which output an age interval. The experiment was conducted on Amazon Mechanical Turk with 1,058 participants. Overall, participants were more accurate at guessing people’s ages when they had access to the model’s guesses, but having explanations of the model outputs did not further improve their accuracy. The authors note that this is likely because explanations had little effect on user trust in the model’s outputs. The trust that participants had in each model differed only slightly between conditions, regardless of whether explanations were the real saliency maps or randomly generated (there is a slight trend but it is not statistically significant). In fact, participants found explanations to be “reasonable” even when they focused on the background and not on the face. The authors give quotes from participants explaining their reasoning processes. One participant, for example, noticed that explanations could appear faulty but thought the model’s guesses seemed reasonable otherwise, so they “sort of went with it.”

  • Do explanations make VQA models more predictable to a human?

    • EMNLP 2018

    • The paper presents a human subject experiment for evaluating the forward simulatability of a model given various explanation methods, using a Visual Question Answering task. They consider two simulation targets: the model’s binary correctness, and its particular predicted output. They evaluate explanation methods including Grad-CAM, visualized attention weights, and an “instantaneous feedback” condition where no explanation is included, but the simulation target is revealed to the human subject after every response. They find that the explanation procedures do not yield statistically significant improvements in accuracy, while the instantaneous feedback procedure yields large improvements (30 ppts simulation accuracy for predicting model outputs). Human performance on predicting the VQA model’s correctness is not as high as an MLP trained to predict the VQA model’s correctness using the VQA model’s softmax layer’s output as features (about 80% accuracy), but the instantaneous feedback conditions are close, with around 75% failure prediction accuracy.

  • Sanity Checks for Saliency Maps

    • NeurIPS 2018

    • The authors propose two methods to validate saliency maps, an interpretability technique that visually highlights regions of the input that can be attributed to the output. The authors point out that a good saliency map should be sensitive to both the actual model and the input labels; changing either of these should lead to a different map. Eight different saliency map techniques are evaluated: the Vanilla Gradient, Gradient $\hadamard$ Input, Integrated Gradients, Guided BackProp, GradCAM, and SmoothGrad (plus two special cases). The authors run two experiments following their above conjecture. The first randomizes the last N layers’ weights in the model, where N = 1 corresponds to only randomizing the last layer, and when N = model size, all weights are random. The reasoning here is that a good saliency map should be a function of the model, and not of just the input (e.g. acting like a model-agnostic edge detector). Comparison between the original saliency map and the new saliency map (on the randomized model) is done through visualizing both maps, as well as quantitatively via Spearman rank correlation, the structural similarity index measure, and the Pearson correlation of the histogram of gradients. In this first experiment, the authors find that the Vanilla Gradient is sensitive while Guided BackProp and Guided GradCAM show no change despite model degradation. The second experiment randomizes the labels of the input data and trains a new model. The reasoning is that saliency maps should also be sensitive to the true model; outlining a bird in the image, for example, is not useful if the true label is “dog”. The model is trained to at least 95% training accuracy and then the saliency maps are applied. Again, the Vanilla Gradient shows sensitivity. Integrated Gradients and Gradient $\hadamard$ Input continue to highlight much of the same input structure. Both experiments were conducted on a variety of models and datasets, including Inception v3 trained on ImageNet, CNN on MNIST and Fashion MNIST, MLP trained on MNIST, and Inception v4 trained on Skeletal Radiograms.

  • Counterfactual Visual Explanations

    • ICML 2019

    • In this paper, the authors propose a method of generating counterfactual explanations for image models. A counterfactual explanation in this framework is a part of the input image that, if changed, would lead to a different class prediction. The authors formalize the minimum-edit counterfactual problem which is defined to be the smallest number of replacements between an input I (which the model classifies as label A) and another input I’ (which the model classifies as label B) such that the model will predict class B for the newly edited input I. The actual edit is done by permuting I’ and then replacing a subset of I with values from the permuted I’. Because the space is so large to solve this problem exactly, the authors present two greedy relaxations of the problem. The first method is to iteratively look for the single edit which leads to the largest increase in log probability between the original and subsequent class predictions for class B. The second method is to, instead of taking a direct subset of I’ values (which was done via the Hadamard product of a binary vector with I’), allow it to be a point on the simplex of a distribution over all features in I’. Then, both the permutation and the subset coefficients are learned via gradient descent. These explanations are used on four datasets: SHAPES, MNIST, Omniglot, and Caltech-UCSD Birds (CUB). In all four cases, the explanation is generated from the last layer of the CNN used. The authors evaluate the explanations qualitatively by examining which regions from I and I’ are permuted to form the new counterfactual image. In the shown examples, the counterfactual images are constructed via appropriate portions of I’, for example a “1” from MNIST incorporating another spoke from a “4” to look more like it. The authors also evaluate the average number of edits needed to change the class label. The authors then used the counterfactual explanations from the CUB dataset to set up a training task where graduate students were tasked with learning how to classify images into one of two classes (which is not a trivial task). When participants got a choice wrong in the training phase, they were shown a counterfactual image. Their performance on the test phase was compared to two other baselines: students where were given no example (only right/wrong) during training and students who were shown a GradCAM heatmap during training. The counterfactual image group had the highest accuracy, but this was not significant at the 90% confidence level against either baseline.

  • Explanation by Progressive Exaggeration

    • ICLR 2020

    • The authors “propose a method that explains the outcome of a classification black-box by gradually exaggerating the semantic effect of a given class.” A resulting explanation is a series of altered images shifting from one class to another. Their method uses GANs as the underlying model for the generation of images; at each step, they make a change such that the model’s probability of the desired class increases from the previous step. The authors run six experiments using two types of data: human faces and X-rays. Their evaluations include: qualitative analysis of explanations including identifying model biases / conflation of features, checking that statistics of altered images match those of real images receiving the same model output, and the effect on accuracy of corrupting pixels identified by their method as “important” to a class. They also run human studies where they see if MTurkers can identify the target attribute being explained based on the explanations; participant accuracy was from around 77% to 93% depending on the difficulty of the task.

More Papers on Explainable RL

Below we give a few papers appearing since 2020 without summaries, for readers interested in subsequent work.

Commentary

Few papers relative to problem importance. We have come across surprisingly few papers in this area relative to its importance. There appear to be important questions unique to explaining agents (rather than classifiers). For instance, explaining agents’ behaviors will require special consideration of actions’ temporal dependence, agent “plans”, and epistemic vs. instrumental rationality. And the whole exercise will be complicated by multi-agent scenarios. This area really merits a lot more work, and for people interested in AI safety, this is likely one of the more important subareas of interpretability research.

Limited scope of explanation methods for RL agents. To date, most of the work in explainable RL has been applying approaches from the Feature Importance Estimation and Counterfactual Generation literature, though there is also an interesting line of work focusing on causal models of agent behavior. Some interesting early results have emerged regarding the kinds of explanations that help users build better mental models of agents, but so far this area is so new that it remains to be seen what the most promising approaches are, and the field may benefit from a wider variety of explanation methods. There are many other general approaches to explanation, but a few potentially useful ones include: (1) interpreting model weights and representations; (2) explanation by examples, prototypes, and exemplars; (3) finding influential training data; (4) giving natural language explanations; (5) and extensible test case design.

Unclear which evaluation procedures are best for explaining agents. There are now many ways to evaluate explanations, including procedures for evaluating explanations of arbitrary format. The approaches include both automatic procedures as well human study designs, and the bulk of the work has focused on feature importance estimates. We are excited by many of the approaches, particularly those assessing whether explanations improve human-AI team performance at a task that is hard for either humans or AI alone.

One trouble here is that there are so many evaluation procedures, it can be hard for methods papers to choose which to use. It at least seems like each evaluation procedure is equally likely to be used in any given methods paper (though there is a noticeable preference for automatic methods over human studies). We imagine this trend arises partly from the following situation: (1) there is not a common understanding of which explanation procedures answer which research questions; (2) methodologies are introduced without sufficiently precise research questions in mind. (Alternatively, papers can truly need their own new evaluation schemes, because they are answering new questions.)

Here’s an example of the above situation. There is a lot of confusion over what the actual object of our explanation is when we estimate feature importance (i.e. make salience maps). Several research questions present themselves: should feature importance estimates explain the role of features in (1) the behavior of particular trained model with fixed weights, or (2) the behavior of trained models obtained by a stochastic training procedure involving a model family and dataset/environment, or (3) the solvability of a task, either in theory or with respect to a given training procedure, model family, and dataset/environment? Each research question stems from a fundamentally different goal, but papers rarely distinguish between them. Do we want to learn about a given model, a family of models, or the true nature of a task? There is not yet a clear and commonly accepted set of evaluation procedures suited for each of these questions which papers on feature importance estimation can readily use. The result is that the literature is not nearly as cumulative as it could be. For any given research question, it is hard to find one-to-one comparisons between more than a couple papers which would help you tell which methods are well suited to answering the question.

Lastly, qualitative analysis remains very popular throughout papers. It would likely be a marginal improvement to the field if some standards for qualitative analysis were more widely adopted, or someone wrote something good about what those standards should be. We do not mind “expert evaluation (by the author)” of the kind where the authors carry out some systematic qualitative coding regarding their method performance, but this quickly looks less like standard qualitative analysis and more like a measurable outcome.