ChartRef: Benchmarking Fine-Grained Visual Element Localization in Charts

Megan Tjandrasuwita; Paul Pu Liang; Armando Solar-Lezama

ChartRef: Benchmarking Fine-Grained Visual Element Localization in Charts

Megan Tjandrasuwita, Paul Pu Liang, Armando Solar-Lezama

12 Sept 2025 (modified: 17 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: chart perception, chart object detection, chart visual grounding, synthetic data generation

TL;DR: We propose a scalable data generation pipeline to collect ChartRef, a benchmark with paired questions, answers, and bounding boxes of chart visual elements, and evaluate the capabilities of foundation models on fine-grained chart visual grounding.

Abstract: Humans interpret charts by first localizing visual elements—such as bars, markers, and segments—before reasoning over the data. In contrast, current multimodal models primarily rely on text reasoning, limiting their ability to leverage fine-grained visual information. To address this, we introduce ChartRef, a dataset of 38,846 questions, answers, referential expressions, and bounding boxes across 1,141 figures and 11 chart types. Our key insight is that the chart-rendering code makes it possible to generate visual element localizations that are aligned with question–answer pairs. Given only the Python script, a large language model infers the semantics of plotted data, maps data series to visual encodings, and programmatically extracts bounding boxes, yielding visual annotations for charts at scale. Using ChartRef, we benchmark multimodal LLMs and find 3–7\% accuracy improvements on chart question answering when models are provided with ground-truth bounding boxes. We further evaluate vision and multimodal models on chart object detection and visual grounding. While object detection exceeds 80 AP@50, phrase grounding accuracy is only 2.8, revealing a significant gap: current models can recognize chart elements perceptually but struggle to integrate context cues from axes, legends, labels, and data to ground fine-grained textual references. We hope to inspire future work in developing multimodal models capable of human-like chart visual grounding.

Primary Area: datasets and benchmarks

Submission Number: 4436

Loading