ChartMuseum: Testing Visual Reasoning Capabilities of Large Vision-Language Models

Published: 18 Sept 2025, Last Modified: 30 Oct 2025NeurIPS 2025 Datasets and Benchmarks Track posterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: chart understanding, LVLM, reasoning
Abstract: Chart understanding presents a unique challenge for large vision-language models (LVLMs), as it requires the integration of sophisticated textual and visual reasoning capabilities. However, current LVLMs exhibit a notable imbalance between these skills, falling short on visual reasoning that is difficult to perform in text. We conduct a case study using a synthetic dataset solvable only through visual reasoning and show that model performance degrades significantly with increasing visual complexity, while human performance remains robust. We then introduce *ChartMuseum*, a new Chart Question Answering (QA) benchmark containing 1,162 expert-annotated questions spanning multiple reasoning types, curated from real-world charts across 184 sources, specifically built to evaluate complex visual and textual reasoning. Unlike prior chart understanding benchmarks---where frontier models perform similarly and near saturation---our benchmark exposes a substantial gap between model and human performance, while effectively differentiating model capabilities: although humans achieve 93% accuracy, the best-performing model Gemini-2.5-Pro attains only 63.0%, and the leading open-source LVLM Qwen2.5-VL-72B-Instruct achieves only 38.5%. Moreover, on questions requiring primarily visual reasoning, *all* models experience a 35%-55% performance drop from text-reasoning-heavy question performance. Lastly, our qualitative error analysis reveals specific categories of visual reasoning that are challenging for current LVLMs. Both ChartMuseum and the evaluation code are available at [https://github.com/Liyan06/ChartMuseum](https://github.com/Liyan06/ChartMuseum).
Croissant File: json
Dataset URL: https://huggingface.co/datasets/lytang/ChartMuseum
Code URL: https://github.com/Liyan06/ChartMuseum
Primary Area: Datasets & Benchmarks for applications in language modeling and vision language modeling
Submission Number: 2133
Loading