Benchmarking Complex Chart Reasoning via Sub-question Decomposition and Variant-based Robustness Analysis

Benchmarking Complex Chart Reasoning via Sub-question Decomposition and Variant-based Robustness Analysis

ICLR 2026 Conference Submission16937 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal Large Language Models, Benchmark, Chart Understanding, Chart Reasoning

Abstract: Multimodal Large Language Models (MLLMs) demonstrate strong potential for chart interpretation, yet existing benchmarks mainly assess final-answer accuracy, neglecting intermediate reasoning validity and robustness to visual perturbations. We present CHART-FGR, a fine-grained benchmark that decomposes each complex question into interpretable sub-questions and tests models under five visual perturbations (blur, noise, watermark, label removal, color distortion). Spanning 20 chart types, 200 base charts yield 1,652 sub-questions and 8,260 QA pairs across 1,000 images. Evaluations of leading MLLMs show significant performance drops (18–42 \%) and reveal that most failures stem from early decomposition or perception errors. These findings highlight the necessity of process-oriented diagnostics to ensure trustworthy deployment in real-world, low-quality visual environments. Code is available at https://anonymous.4open.science/r/ChartSQA-DACC/.

Primary Area: datasets and benchmarks

Submission Number: 16937

Loading