Abstract: Chart Question Answering requires integrating visual understanding with complex reasoning, yet current multimodal large language models (MLLMs) struggle to bridge this critical modality gap, failing to transfer robust text-based reasoning capabilities to chart analysis. To address this challenge, we introduce ChartReasoner, a code-driven framework that transforms charts into symbolic ECharts code representations, enabling text-based reasoning mechanisms to operate directly on structured chart data. Our approach combines three key innovations: Chart2Code, an MLLMs trained on 110K diverse charts that accurately generates executable chart code; ChartReasoning, a dataset of 140K samples with explicit reasoning chains across mathematical, visual, fact-checking, and data retrieval operations; and a two-stage training methodology integrating Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO) to enhance reasoning consistency. Experimental results demonstrate that our model achieves performance comparable to state-of-the-art open-source models on standard benchmarks such as ChartQA and ChartBench, while using fewer model parameters, and competes with proprietary models like GPT-4o even in challenging out-of-domain scenarios. Our work reveals that symbolic code representations provide an effective bridge between visual and textual modalities, enabling more accurate and generalizable reasoning for chart understanding tasks.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: reasoning,multimodal QA,reinforcement learning,multimodality
Contribution Types: Model analysis & interpretability, Data resources, Data analysis
Languages Studied: English
Submission Number: 3907
Loading