Where Reliability Lives in Vision-Language Models: A Mechanistic Study of Attention, Hidden States, and Causal Circuits

Logan Mann; Ajit Saravanan; Ishan Dave; Shikhar Shiromani; Saadullah ismail; Yi Xia; Emily Huang

Where Reliability Lives in Vision-Language Models: A Mechanistic Study of Attention, Hidden States, and Causal Circuits

Logan Mann, Ajit Saravanan, Ishan Dave, Shikhar Shiromani, Saadullah ismail, Yi Xia, Emily Huang

Published: 11 Jun 2026, Last Modified: 15 Jun 2026Mech Interp Workshop ICML 2026 VirtualposterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Vision transformers, Probing, Causal interventions

Other Keywords: vision-language models, multimodal reliability, self-consistency, hidden-state probes

TL;DR: In vision-language models, attention-map structure is a weak signal of answer reliability; stronger reliability information appears later in hidden states, generation consistency, and family-specific causal circuits.

Abstract: Vision-language models can produce confident, fluent mistakes, but it is still unclear where their internal reliability signal actually lives. A natural hypothesis is that reliability should be visible in visual attention: sharper focus on the relevant region should imply a more trustworthy answer. We test this hypothesis with VLM Reliability Probe (VRP), a cross-family study of LLaVA-1.5, PaliGemma, and Qwen2-VL that compares three classes of evidence: attention-map structure, generation dynamics, and hidden-state mechanisms. Our main claim is that attention structure is a poor reliability readout even when attention remains causally important for feature extraction: across the pooled structural-analysis set, cluster count and spatial entropy are nearly uncorrelated with correctness $R(C_k,y)=0.001$, $R(H_s,y)=-0.012$. Instead, the strongest reliability signals emerge later in the computation. Self-consistency is the strongest behavioral predictor we measure $R=0.429$, while hidden-state probes provide the best single-pass signal (AUROC $>0.95$ in our strongest settings). We further find a mechanistic split across model families: LLaVA exhibits early locking and a fragile late bottleneck, whereas PaliGemma and Qwen2-VL distribute reliability more broadly and remain robust under large interventions. The takeaway is narrow but important: in current VLMs, reliability is better understood through hidden-state geometry, layer-wise margin dynamics, and causal circuits than through attention-map sharpness alone.

Submission Number: 54

Loading