Self Reward Design with Fine-grained InterpretabilityDownload PDF


Sep 29, 2021 (edited Oct 04, 2021)ICLR 2022 Conference Blind SubmissionReaders: Everyone
  • Keywords: Reinforcement Learning, Interpretability, Explainable Artificial Intelligence, Neural Networks
  • Abstract: Transparency and fairness issues in Deep Reinforcement Learning may stem from the black-box nature of deep neural networks used to learn its policy, value functions etc. This paper proposes a way to circumvent the issues through the bottom-up design of neural networks (NN) with detailed interpretability, where each neuron or layer has its own meaning and utility that corresponds to humanly understandable concept. With deliberate design, we show that lavaland problems can be solved using NN model with few parameters. Furthermore, we introduce the Self Reward Design (SRD), inspired by the Inverse Reward Design, so that our interpretable design can (1) solve the problem by pure design (although imperfectly) (2) be optimized via SRD (3) perform avoidance of unknown states by recognizing the inactivations of neurons aggregated as the activation in \(w_{unknown}\).
  • One-sentence Summary: Reinforcement Learning with fine-grained Interpretable Neural Network Designs and Self Reward Design
  • Supplementary Material: zip
0 Replies