Self Reward Design with Fine-grained Interpretability

Erico Tjoa; Cuntai Guan

Self Reward Design with Fine-grained Interpretability

Erico Tjoa, Cuntai Guan

Published: 28 Jan 2022, Last Modified: 26 May 2025ICLR 2022 SubmittedReaders: Everyone

Keywords: Reinforcement Learning, Interpretability, Explainable Artificial Intelligence, Neural Networks

Abstract: Transparency and fairness issues in Deep Reinforcement Learning may stem from the black-box nature of deep neural networks used to learn its policy, value functions etc. This paper proposes a way to circumvent the issues through the bottom-up design of neural networks (NN) with detailed interpretability, where each neuron or layer has its own meaning and utility that corresponds to humanly understandable concept. With deliberate design, we show that lavaland problems can be solved using NN model with few parameters. Furthermore, we introduce the Self Reward Design (SRD), inspired by the Inverse Reward Design, so that our interpretable design can (1) solve the problem by pure design (although imperfectly) (2) be optimized via SRD (3) perform avoidance of unknown states by recognizing the inactivations of neurons aggregated as the activation in \(w_{unknown}\).

One-sentence Summary: Reinforcement Learning with fine-grained Interpretable Neural Network Designs and Self Reward Design

Supplementary Material: zip

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 3 code implementations](https://www.catalyzex.com/paper/self-reward-design-with-fine-grained/code)

18 Replies

Loading