OpenMAIA: a Multimodal Automated Interpretability Agent based on open-source models

Published: 30 Sept 2025, Last Modified: 08 Nov 2025Mech Interp Workshop (NeurIPS 2025) PosterEveryoneRevisionsBibTeXCC BY 4.0
Open Source Links: https://github.com/multimodal-interpretability/maia
Keywords: Automated interpretability, Causal interventions, Interpretability tooling and software
Other Keywords: Neuron Explanations, Agent-Based Interpretability, Multimodal LLMs, Open-Source Models, Reproducibility
TL;DR: OpenMAIA brings agent-based neuron interpretability to open multimodal LLMs with competitive accuracy and full reproducibility.
Abstract: Understanding how large neural networks represent and transform information still remains a major obstacle to achieving transparent AI systems. Recent works such as MAIA (a Multimodal Automated Interpretability Agent) have shown that agent-based systems can iteratively generate and test hypotheses about neuron function without the need for human intervention, which offers a scalable solution for mechanistic interpretability. However, the existing agent-based systems rely on closed-source APIs, limiting reproducibility and access. To address this, we introduce OpenMAIA, an open-source implementation of MAIA that replaces its closed-source API-based components with open-source models. We experiment with two state-of-the-art multimodal Large Language Models (LLMs) (Gemma-3-27B, Mistral-Small-3.2-24B) as the OpenMAIA backbone models, and update the agent's interpretability toolset with open-source models. Following the neuron description evaluation protocol established in the original MAIA paper, which uses neurons from different vision backbones and also synthetic neurons, we show that OpenMAIA, when using an open-source backbone, achieves performance comparable to the same OpenMAIA configuration that employs Claude-Sonnet-4 as its backbone model. In addition, OpenMAIA converges more efficiently than its implementation with Claude-Sonnet-4. These results demonstrate that competitive, agent-based interpretability can be achieved with a fully open stack, providing a practical and reproducible foundation for community-driven research.
Submission Number: 169
Loading