Towards Understanding Multimodal Fine-Tuning: A Case Study into Spatial Features

Lachin Naghashyar; Hunar Batra; Ashkan Khakzar; Philip Torr; Ronald Clark; Christian Schroeder de Witt; Constantin Venhoff

Towards Understanding Multimodal Fine-Tuning: A Case Study into Spatial Features

Lachin Naghashyar, Hunar Batra, Ashkan Khakzar, Philip Torr, Ronald Clark, Christian Schroeder de Witt, Constantin Venhoff

Published: 30 Sept 2025, Last Modified: 10 Nov 2025Mech Interp Workshop (NeurIPS 2025) PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Sparse Autoencoders, Vision transformers, Causal interventions

TL;DR: We apply stage-wise model diffing with sparse autoencoders to multimodal fine-tuning and show how vision-language models develop spatially grounded features and the attention heads that drive them.

Abstract: Contemporary Vision–Language Models (VLMs) achieve strong performance on a wide range of tasks by pairing a vision encoder with a pre-trained language model, fine-tuned for visual–text inputs. Yet despite these gains, it remains unclear how language backbone representations adapt during multimodal training and when vision-specific capabilities emerge. In this work, we present the first mechanistic analysis of VLMs adaptation process. Using stage-wise model diffing, a technique that isolates representational changes introduced during multimodal fine-tuning, we reveal how a language model learns to "see". We first identify vision-preferring features that emerge or reorient during fine-tuning. We then show that a selective subset of these features reliably encodes spatial relations, revealed through controlled shifts to spatial prompts. Finally, we trace the causal activation of these features to a small group of attention heads. Our findings show that stage-wise model diffing reveals when and where spatially-grounded multimodal features arise. It also provides a clearer view of modality fusion by showing how visual grounding reshapes features that were previously text-only. This methodology enhances the interpretability of multimodal training and provides a foundation for understanding and refining how pretrained language models acquire vision-grounded capabilities.

Submission Number: 325

Loading