Contrastive Explanations for Reinforcement Learning via Embedded Self Predictions

Zhengxian Lin; Kin-Ho Lam; Alan Fern

Contrastive Explanations for Reinforcement Learning via Embedded Self Predictions

Zhengxian Lin, Kin-Ho Lam, Alan Fern

Published: 12 Jan 2021, Last Modified: 05 May 2023ICLR 2021 OralReaders: Everyone

Keywords: Explainable AI, Deep Reinforcement Learning

Abstract: We investigate a deep reinforcement learning (RL) architecture that supports explaining why a learned agent prefers one action over another. The key idea is to learn action-values that are directly represented via human-understandable properties of expected futures. This is realized via the embedded self-prediction (ESP) model, which learns said properties in terms of human provided features. Action preferences can then be explained by contrasting the future properties predicted for each action. To address cases where there are a large number of features, we develop a novel method for computing minimal sufficient explanations from an ESP. Our case studies in three domains, including a complex strategy game, show that ESP models can be effectively learned and support insightful explanations.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

One-sentence Summary: We introduced the embedded self-prediction (ESP) model for producing meaningful and sound contrastive explanations for RL agents.

Supplementary Material: zip

Code: [![github](/images/github_icon.svg) SuerpX/Embedded-Self-Predictions](https://github.com/SuerpX/Embedded-Self-Predictions)

8 Replies

Loading