Towards Understanding Multimodal Fine-Tuning:  A Case Study into Spatial Features

Lachin Naghashyar; Constantin Venhoff; Ashkan Khakzar; Philip Torr; Ronald Clark; Christian Schroeder de Witt

Towards Understanding Multimodal Fine-Tuning: A Case Study into Spatial Features

Lachin Naghashyar, Constantin Venhoff, Ashkan Khakzar, Philip Torr, Ronald Clark, Christian Schroeder de Witt

Published: 22 Sept 2025, Last Modified: 22 Sept 2025WiML @ NeurIPS 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Vision–Language Models, Multimodal Training, Mechanistic Interpretability, Stage-wise Model Diffing, Sparse Autoencoders, Spatial Reasoning, Attention Heads

Abstract: Vision–Language Models (VLMs) demonstrate strong performance on a wide range of tasks by fine-tuning pretrained language backbones to process projected visual tokens alongside text. Yet despite these empirical gains, it remains unclear how language backbone representations adapt during multimodal training and when vision-specific capabilities emerge. In this work, we present the first mechanistic analysis of VLMs adaptation with stage-wise model diffing, a technique that isolates representational changes introduced during multimodal fine-tuning to reveal how a language model learns to "see". Concretely, we fine-tune sparse autoencoders trained on LLaMA-3.1-8B over multimodal activations from LLaVA-More (based on LLaMA-3.1-8B) using 50k VQAv2 pairs. We first isolate vision-preferring features that appear or reorient during multimodal fine-tuning. We then test for spatial selectivity using a controlled shift to spatial prompts to identify the attention heads that causally activate these units. Our findings show that stage-wise model diffing reveals when and how spatially grounded multimodal features arise. It also provides a clearer view of modality fusion by showing how visual grounding reshapes features that were previously text-only. This methodology enhances the interpretability of multimodal training and provides a foundation for understanding and refining how pretrained language backbones acquire vision-grounded capabilities.

Submission Number: 397

Loading