Keywords: Interpretability, Reinforcement Learning, Clustering, Semantics, VAE
Abstract: In this paper, we explore semantic clustering properties of deep reinforcement learning (DRL) to improve its interpretability and deepen our understanding of the internal semantic organization. In this context, semantic clustering refers to the ability of neural networks to cluster inputs based on their semantic similarity in the internal space. We propose a DRL architecture that incorporates a novel semantic clustering module, which includes both feature dimensionality reduction and online clustering. This module integrates seamlessly into the DRL training pipeline, addressing the instability of t-SNE and eliminating the need for extensive manual annotation in the previous semantic analysis methods. Through experiments, we validate the effectiveness of the proposed module and demonstrate its ability to reveal semantic clustering properties within DRL. Furthermore, we introduce new analytical methods that leverage these properties to provide insights into the hierarchical structure of policies and the semantic organization within the feature space. These methods also help identify potential risks within the model, offering a deeper understanding of its limitations and guiding future improvements.
Supplementary Material: zip
Primary Area: interpretability and explainable AI
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 5216
Loading