This State Looks Like That: Self-Interpretable Reinforcement Learning Agents using Prototype Soft Actor-Critic

Andrea Marzo; Alessio Ragno; Roberto Capobianco

This State Looks Like That: Self-Interpretable Reinforcement Learning Agents using Prototype Soft Actor-Critic

Andrea Marzo, Alessio Ragno, Roberto Capobianco

16 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: explainable ai, reinforcement learning, interpretability, self explainable, soft actor critic

Abstract: Reinforcement learning (RL) has achieved remarkable success across complex decision-making tasks, especially with the advent of deep neural networks. However, the resulting models are often opaque, making their deployment in safety-critical domains challenging. Explainable AI aims to address this issue, but most specific efforts for deep RL remain limited either to post-hoc explanation methods or to imitation learning and distillation procedures. These latter approaches rely on pre-trained black-box agents and are typically restricted to environments with discrete action spaces, limiting their scalability and interpretability. In this paper, we introduce ProtoSAC, a novel deep RL architecture that integrates a prototype-based actor into the Soft Actor-Critic (SAC) algorithm, enabling intrinsic interpretability in continuous action spaces. Our method learns a set of prototypes that represent interpretable state clusters, each associated with a Gaussian action distribution. Actions are generated as a similarity-weighted mixture over these prototypes, providing transparent decision-making without sacrificing performance. We evaluate ProtoSAC on continuous action-space environments and show that it matches the performance of the original SAC while offering enhanced interpretability.

Supplementary Material: zip

Primary Area: interpretability and explainable AI

Submission Number: 6999

Loading