Keywords: Circuit Analysis, Attribution Graphs, Applications of interpretability, Interpretability for AI Safety
TL;DR: We present a mechanistic analysis of model compression in Vision Language Models.
Abstract: Compressed vision-language models (VLMs) are widely used to reduce memory and compute costs, making them a suitable choice for real-world deployment. However, compressing these models raises concerns about whether internal computations and safety behaviors are preserved. In this work, we use causal \textbf{circuit analysis} and \textbf{crosscoder}-based feature comparisons to examine how pruning and quantization impose altercations in the internal representations across VLMs. We observe that pruning generally keeps circuit structure intact but \textit{rotates} and \textit{attenuates} internal features, while quantization modifies the circuits at a higher level yet leaves the surviving features better \textit{aligned}. We further evaluate these effects on the refusal behavior in VLMs. Using a novel benchmark, \textbf{VLMSafe-420}, containing harmful prompts and benign counterfactuals across multiple safety categories and modalities, we show that pruning and quantization produce distinct degradations in genuine refusal behavior that reflect their underlying representational changes. Hence, the choice of model compression also has important implications for AI safety.
Submission Number: 353
Loading