Keywords: chart understanding, multimodal reasoning, curriculum learning, scalable chart understanding benchmark, large multimodal language models
TL;DR: We propose a curriculum learning framework with tailored multi-level datasets that guide MLLMs to reason with dynamic visual grounding, thereby enhancing their chart understanding and multimodal reasoning capabilities.
Abstract: Chart question answering (CQA) requires multimodal large language models (MLLMs) to integrate visual comprehension with logical reasoning, yet current models struggle with accurate visual grounding and coherent reasoning chains.
While external chain-of-thought prompting and visual cues significantly improve performance, current MLLMs lack intrinsic visually grounded reasoning capabilities, leading to inaccurate perception and reasoning disconnected from visual evidence.
To address these limitations, we propose CURV, a curriculum learning framework that develops intrinsic visual grounded reasoning capabilities by reformulating CQA as multi-turn visual reasoning, where each step coordinates logical reasoning with dynamic visual grounding through spatial attention concentration.
To assist model learning, we further introduce CCQA, a three-level curriculum dataset with scalable synthetic generation across diverse chart types and reasoning patterns. Our curriculum systematically progresses from basic single-operation reasoning to complex multi-chart compositional tasks.
Experiments demonstrate that CURV achieves up to 10.79% accuracy improvements over baselines and strong generalization to real-world benchmarks and out-of-domain multimodal reasoning tasks, validating the effectiveness of internalizing visual reasoning with dynamic grounding for enhanced chart understanding capabilities.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 22768
Loading