Data-Centric Interpretability for LLM-based Multi-Agent Reinforcement Learning

Published: 01 Mar 2026, Last Modified: 24 Apr 2026ICLR 2026 AIWILDEveryoneRevisionsCC BY 4.0
Keywords: agents, interpretability, trustworthy ai, large language models, multi-agent systems, long-horizon training, sparse autoencoders, LLM-as-a-judge, reinforcement learning, strategy games
TL;DR: We combine sparse autoencoders and LLM judges to interpret emergent behavior in long-horizon multi-agent LLM training.
Abstract: Large language model (LLM) agents are deployed in increasingly complex multi-agent environments, making it critical to detect unsafe or unintended behaviors during training. Sparse Autoencoders (SAEs) have recently shown to be useful for data-centric interpretability. In this work, we analyze large-scale reinforcement learning training runs from the sophisticated environment of Full-Press Diplomacy by applying pretrained SAEs, alongside LLM-summarizer methods. We introduce Meta-Autointerp, a method for grouping SAE features into interpretable hypotheses about training dynamics. We discover fine-grained behaviors including role-playing patterns, degenerate outputs, language switching, alongside high-level strategic behaviors and environment-specific bugs. Through automated evaluation, we validate that 90\% of discovered SAE Meta-Features are significant, and discover a novel reward hacking pattern where incentivized behaviors generalize into structurally similar but unrewarded actions. However, through two user studies, we find that even subjectively interesting and seemingly helpful SAE features may be worse than useless to humans, along with most LLM generated hypotheses. However, a subset of SAE-derived hypotheses are predictively useful for downstream tasks. We further provide validation by augmenting an untrained agent's system prompt, improving the score by +14.2%. Overall, we show that SAEs and LLM-summarizer provide complementary views into agent behavior, and together our framework forms a practical starting point for future data-centric interpretability work on ensuring trustworthy LLM behavior throughout training.
PDF: pdf
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 87
Loading