Understanding Multimodal LLMs: the Mechanistic Interpretability of Llava in Visual Question Answering

Understanding Multimodal LLMs: the Mechanistic Interpretability of Llava in Visual Question Answering

ACL ARR 2025 February Submission103 Authors

03 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Understanding the mechanisms behind Large Language Models (LLMs) is crucial for designing better models and strategies. While recent studies have yielded valuable insights into the mechanisms of textual LLMs, the mechanisms of Multi-modal Large Language Models (MLLMs) remain underexplored. In this paper, we apply mechanistic interpretability methods to analyze the visual question answering (VQA) mechanisms in an MLLM, Llava. We compare the mechanisms between VQA and textual QA (TQA) in color answering tasks and find that: a) VQA exhibits a mechanism similar to the in-context learning mechanism observed in TQA; b) the visual features exhibit significant interpretability when projecting the visual embeddings into the embedding space; and c) Llava enhances the existing capabilities of the corresponding textual LLM Vicuna during visual instruction tuning. Based on these findings, we develop an interpretability tool to help users and researchers identify important visual locations for final predictions, aiding in the understanding of visual hallucination. Our method demonstrates faster and more effective results compared to existing interpretability approaches. Our code will be available on Github.

Paper Type: Long

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: mechanistic interpretability, multimodal LLM

Contribution Types: Model analysis & interpretability, Publicly available software and/or pre-trained models

Languages Studied: English

Submission Number: 103

Loading