Keywords: Chart Question-Answering, Benchmark, Multimodal Large Language Models
TL;DR: A benchmark for evaluating MLLM's cross-modal reasoning capability in multi-chart, context-rich document scenarios.
Abstract: While Multimodal Large Language Models (MLLMs) have achieved remarkable success on single-chart question answering tasks, reaching over 90% accuracy on benchmarks such as PlotQA, this apparent success masks a critical limitation. Current models perform poorly on complex, multi-chart reasoning tasks that mirror real-world analytical scenarios. In professional document analysis, users typically integrate information across multiple visualizations within rich contextual frameworks rather than examining isolated charts, which is a capability that remains largely unexplored in existing evaluations. To bridge this gap, we introduce ChartNexus, a novel and challenging benchmark specifically designed to assess multi-chart reasoning capabilities of MLLMs in authentic document contexts. ChartNexus comprises 1,370 carefully curated question-answering pairs derived from 6,793 real-world charts spanning 18 domains, including scientific papers, government reports, and industry analyses. Each question demands complex reasoning skills, such as comparative analysis, sequential information integration, and cross-modal synthesis between visual and textual elements. We design a comprehensive taxonomy featuring 4 high-level difficulty categories and 11 fine-grained sub-categories to systematically evaluate these capabilities. Our comprehensive evaluation of 23 state-of-the-art MLLMs reveals significant performance degradation compared to single-chart benchmarks. While the best commercial model achieves over 90% accuracy on simpler tasks, its performance drops by more than half on ChartNexus. Through systematic failure analysis, we identify critical weaknesses in current models’ ability to maintain working memory across multiple charts, perform cross-modal reasoning, and integrate contextual information effectively. ChartNexus establishes a new frontier for evaluating complex chart understanding capabilities, demonstrating that robust multi-chart reasoning remains an open challenge. Our benchmark and comprehensive analysis provide the research community with essential diagnostic tools to advance the development of more capable and practically useful MLLMs for real-world document analysis scenarios.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 18767
Loading