MultiChartQA-R: A Benchmark for Multi-Chart Question Answering in Real-World Reasoning Scenarios

18 Sept 2025 (modified: 16 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: datasets and benchmarks, multi-chart question answering, cross-chart reasoning, chart understanding and reasoning
Abstract: Existing benchmarks for chart analysis primarily focus on single-chart tasks, whereas multi-chart benchmarks are scarce and limited to simplistic question types, making it difficult to comprehensively evaluate the reasoning and decision-making capabilities of multimodal large language models (MLLMs) in realistic scenarios. We present MultiChartQA-R, a benchmark designed to evaluate multi-chart question answering capabilities, ranging from fundamental abilities to decision-making applications, with four progressively complex reasoning tasks that encompass real-world scenarios: cross-chart trend comparison, complementary data integration, anomaly and causal analysis, and strategy recommendation. The benchmark consists of versions in three major languages, each containing 695 chart–code pairs and 2,160 QA pairs, with extensibility to additional languages. We further propose a flexible multiple-choice evaluation metric that can be adjusted based on real-world scenarios, along with an extended dataset consisting of 512 charts and 1,212 QA pairs, designed to study retrieval and scaling behavior as the number of charts increases. Our evaluation of 13 representative MLLMs (4 proprietary models and 9 open-weight models) reveals significant performance gaps compared to human, especially in cross-chart visual perception, data integration, and aligning with human preferences. Additionally, our experiments reveal interesting multilingual characteristics of multi-chart question answering.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 10006
Loading