OpenMAIA: a Multimodal Automated Interpretability Agent based on open-source models

Published: 30 Sept 2025, Last Modified: 30 Sept 2025Mech Interp Workshop (NeurIPS 2025) PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Automated interpretability, Causal interventions, Interpretability tooling and software
Other Keywords: Neuron Explanations, Agent-Based Interpretability, Multimodal LLMs, Open-Source Models, Reproducibility
TL;DR: OpenMAIA brings agent-based neuron interpretability to open multimodal LLMs with competitive accuracy and full reproducibility.
Abstract: Interpreting the internal mechanisms of large neural networks remains a main challenge for trustworthy AI. Recent works such as MAIA (a Multimodal Automated Interpretability Agent) have shown that agent-based systems can iteratively generate and test hypotheses about neuron function without the need for human intervention, which offers a scalable solution for mechanistic interpretability. However, these agent-based systems rely on closed-source APIs, limiting reproducibility and access. To address this, we introduce OpenMAIA, an open-source implementation of MAIA that replaces its main components with open-source models. To this end, we experiment with two state-of-the-art multimodal Large Language Models (LLMs) (Gemma-3-27B, Mistral-Small-3.2-24B) as backbone models, and update the agent's interpretability toolset with open-source models. Following the neuron description evaluation protocol established in the original MAIA paper, applied across multiple vision backbones and synthetic neurons, OpenMAIA achieves predictive accuracy comparable to Claude Sonnet 4 while converging more efficiently. These results demonstrate that competitive, agent-based interpretability can be achieved with a fully open stack, providing a practical and reproducible foundation for community-driven research.
Submission Number: 169
Loading