Benchmarking Complex Chart Reasoning via Sub-question Decomposition and Variant-based Robustness Analysis
Keywords: Multimodal Large Language Models, Benchmark, Chart Understanding, Chart Reasoning
Abstract: Multimodal Large Language Models (MLLMs) demonstrate strong potential for chart interpretation, yet existing benchmarks mainly assess final-answer accuracy, neglecting intermediate reasoning validity and robustness to visual perturbations. We present CHART-FGR, a fine-grained benchmark that decomposes each complex question into interpretable sub-questions and tests models under five visual perturbations (blur, noise, watermark, label removal, color distortion). Spanning 20 chart types, 200 base charts yield 1,652 sub-questions and 8,260 QA pairs across 1,000 images. Evaluations of leading MLLMs show significant performance drops (18–42 \%) and reveal that most failures stem from early decomposition or perception errors. These findings highlight the necessity of process-oriented diagnostics to ensure trustworthy deployment in real-world, low-quality visual environments.
Code is available at https://anonymous.4open.science/r/ChartSQA-DACC/.
Primary Area: datasets and benchmarks
Submission Number: 16937
Loading