Interpreting a deep reinforcement learning model with conceptual embedding and performance analysis

Published: 01 Jan 2023, Last Modified: 24 May 2025Appl. Intell. 2023EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: The weak interpretability of the deep reinforcement learning (DRL) model becomes a serious impediment to the application of DRL agents in certain areas requiring high reliability. To interpret the behavior of a DRL agent, researchers use saliency maps to discover important parts of the agent’s observation that influence its decision. However, the representations of saliency maps still cannot explicitly present the cause and effect between an agent’s actions and its observations. In this paper, we analyze the inference procedure with respect to the DRL architecture and propose embedding interpretable intermediate representations for an agent’s policy, the intermediate representations that are compressed and abstracted for explanation. We utilize a conceptual embedding technique to regulate the latent representation space of the deep models that can produce interpretable causal factors aligned with human concepts. Furthermore, the information loss of intermediate representation is analyzed to define the model performance upper bound and to measure the model performance degeneration. Experiments validate the effectiveness of the proposed method and the relationship between the observation information and an agent’s performance upper bound.
Loading