Explainability Via Causal Self-Talk

Nicholas Andrew Roy; Junkyung Kim; Neil Charles Rabinowitz

Explainability Via Causal Self-Talk

Nicholas Andrew Roy, Junkyung Kim, Neil Charles Rabinowitz

Published: 31 Oct 2022, Last Modified: 07 Oct 2022NeurIPS 2022 AcceptReaders: Everyone

Keywords: explainability, reinforcement learning, deep learning, causality, interpretability

TL;DR: For explainability and control, we train agents to build a causal model of themselves.

Abstract: Explaining the behavior of AI systems is an important problem that, in practice, is generally avoided. While the XAI community has been developing an abundance of techniques, most incur a set of costs that the wider deep learning community has been unwilling to pay in most situations. We take a pragmatic view of the issue, and define a set of desiderata that capture both the ambitions of XAI and the practical constraints of deep learning. We describe an effective way to satisfy all the desiderata: train the AI system to build a causal model of itself. We develop an instance of this solution for Deep RL agents: Causal Self-Talk. CST operates by training the agent to communicate with itself across time. We implement this method in a simulated 3D environment, and show how it enables agents to generate faithful and semantically-meaningful explanations of their own behavior. Beyond explanations, we also demonstrate that these learned models provide new ways of building semantic control interfaces to AI systems.

Supplementary Material: pdf

17 Replies

Loading