Diagnosing Bottlenecks in Data Visualization Understanding by Vision-Language Models

Diagnosing Bottlenecks in Data Visualization Understanding by Vision-Language Models

ICLR 2026 Conference Submission14307 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Interpretability, vision-language models, multimodal reasoning, chart understanding, mathematical reasoning

TL;DR: We leverage tools from mechanistic interpretability to investigate the causes of failure by three vision-language model architectures on basic data visualization understanding tasks.

Abstract: Data visualizations are vital components of many scientific articles and news stories. Current vision-language models (VLMs) still struggle on basic data visualization understanding tasks, but the causes of failure remain unclear. Are VLM failures attributable to limitations in how visual information in the data visualization is encoded, how information is transferred between the vision and language modules, or how information is processed within the language module? We developed $\texttt{FUGU}$, a suite of data visualization understanding tasks, to precisely characterize potential sources of difficulty (e.g., extracting the position of data points, distances between them, and other summary statistics). We used $\texttt{FUGU}$ to investigate three widely used VLMs ( LLaMA-3.2, LLaVA-OneVision , and InternVL3). To diagnose the sources of errors produced by these models, we used activation patching and linear probes to trace information flow through models across a variety of prompting strategies. We found that some models fail to generate the coordinates of individual data points correctly, and these initial errors often lead to erroneous final responses. When these models are provided with the correct coordinates, performance improves substantially, suggesting that the downstream mathematical reasoning steps performed in the language module are sound. Moreover, even when the model generates an incorrect response, the correct coordinates can be successfully read out from the latent representations in the vision encoder, suggesting that the source of these errors lies in the vision-language handoff. We further found that while providing correct coordinates helps with tasks involving one or a small number of data points, it generally worsens performance for tasks that require extracting statistical relationships across many data points (e.g., correlations, clusters). Fine-tuning models on $\texttt{FUGU}$ also fails to yield ceiling performance. These findings point to fundamental architectural constraints in current VLMs that might pose significant challenges for reliable data visualization understanding.

Primary Area: interpretability and explainable AI

Submission Number: 14307

Loading