RL-LIM: Reinforcement Learning-based Locally Interpretable Modeling

Jinsung Yoon; Sercan O. Arik; Tomas Pfister

RL-LIM: Reinforcement Learning-based Locally Interpretable Modeling

Jinsung Yoon, Sercan O. Arik, Tomas Pfister

25 Sept 2019 (modified: 22 Jun 2025)ICLR 2020 Conference Blind SubmissionReaders: Everyone

Keywords: Interpretability, Explanable AI, Explanability

Abstract: Understanding black-box machine learning models is important towards their widespread adoption. However, developing globally interpretable models that explain the behavior of the entire model is challenging. An alternative approach is to explain black-box models through explaining individual prediction using a locally interpretable model. In this paper, we propose a novel method for locally interpretable modeling -- Reinforcement Learning-based Locally Interpretable Modeling (RL-LIM). RL-LIM employs reinforcement learning to select a small number of samples and distill the black-box model prediction into a low-capacity locally interpretable model. Training is guided with a reward that is obtained directly by measuring agreement of the predictions from the locally interpretable model with the black-box model. RL-LIM near-matches the overall prediction performance of black-box models while yielding human-like interpretability, and significantly outperforms state of the art locally interpretable models in terms of overall prediction performance and fidelity.

Code: https://drive.google.com/open?id=1WpjqBHoYyF2W8vSZMjVgzgB1rlTsJq99

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 3 code implementations](https://www.catalyzex.com/paper/rl-lim-reinforcement-learning-based-locally/code)

Original Pdf: pdf

6 Replies

Loading