TL;DR: We propose SAE-V, a mechanistic interpretability framework for multimodal large language models to analyze multimodal feature and enhance their alignment.
Abstract: With the integration of image modality, the semantic space of multimodal large language models (MLLMs) is more complex than text-only models, making their interpretability more challenging and their alignment less stable, particularly susceptible to low-quality data, which can lead to inconsistencies between modalities, hallucinations, and biased outputs. As a result, developing interpretability methods for MLLMs is crucial for improving alignment quality and efficiency. In text-only LLMs, Sparse Autoencoders (SAEs) have gained attention for their ability to interpret latent representations. However, extending SAEs to multimodal settings presents new challenges due to modality fusion and the difficulty of isolating cross-modal representations. To address these challenges, we introduce SAE-V, a mechanistic interpretability framework that extends the SAE paradigm to MLLMs. By identifying and analyzing interpretable features along with their corresponding data, SAE-V enables fine-grained interpretation of both model behavior and data quality, facilitating a deeper understanding of cross-modal interactions and alignment dynamics. Moreover, by utilizing cross-modal feature weighting, SAE-V provides an intrinsic data filtering mechanism to enhance model alignment without requiring additional models. Specifically, when applied to the alignment process of MLLMs, SAE-V-based data filtering methods could achieve more than 110% performance with less than 50% data. Our results highlight SAE-V’s ability to enhance interpretability and alignment in MLLMs, providing insights into their internal mechanisms.
Lay Summary: Modern AI systems that can understand both text and images (like GPT-4o with vision) are becoming increasingly powerful, but we don't fully understand how they work internally. This lack of transparency makes it difficult to ensure these systems behave safely and align with human values. We developed SAE-V, a tool that acts like an "X-ray machine" for these AI models, allowing us to peek inside and see what information they're focusing on.
Our tool works by breaking down the AI's internal computations into interpretable "features" - like identifying when the model is thinking about specific concepts, such as "Doberman dogs" or abstract ideas like "symmetry". We discovered that SAE-V can identify which parts of the training data help the AI understand connections between images and text, versus data that confuses it.
Using this insight, we created a smart data filtering system. Just like a good teacher selects the best examples to help students learn, our method automatically identifies the highest-quality training examples. Remarkably, by keeping only the best quarter of training data identified by SAE-V, we achieved better performance than using all the data, making AI training both more efficient and more effective. This work helps make AI systems more transparent and easier to improve.
Link To Code: https://github.com/PKU-Alignment/SAE-V
Primary Area: Social Aspects->Accountability, Transparency, and Interpretability
Keywords: interpretability, alignment, multimodal large language model
Submission Number: 15602
Loading