# Official Code for "Do MLLMs Really Understand the Charts?"

This repository contains the official code for the paper, "Do MLLMs Really Understand the Charts?". Our work investigates whether Multimodal Large Language Models (MLLMs) possess true visual reasoning capabilities for chart comprehension or if they merely rely on superficial pattern recognition and OCR.

To address this, we introduce:

1.  **ChartVRBench**: A new benchmark specifically designed to evaluate visual reasoning on non-annotated charts, where value estimation requires interpreting scales and positions rather than reading text labels.
2.  **ChartVR**: A novel MLLM trained with a two-stage **Reinforcement Finetuning (RFT)** strategy to mimic human-like reasoning. This approach significantly enhances a model's ability to "read" a chart's structure, moving beyond simply "seeing" its components.

This repository provides all the necessary tools to reproduce our findings, generate new chart data, train custom ChartVR models, and evaluate performance on ChartVRBench.

## 📁 Repository Structure

The project is organized into four main directories, each corresponding to a key stage of our research pipeline:

  - **/make\_data/**: Contains the end-to-end pipeline for programmatically generating the synthetic portion of the **ChartVRBench** dataset.
  - **/chart\_cot/**: Scripts for generating the Chain-of-Thought (CoT) reasoning data required for training ChartVR.
  - **/train/**: The complete training infrastructure for fine-tuning models using our proposed Supervised Fine-Tuning (SFT) and Reinforcement Finetuning (RFT) strategies.
  - **/test/**: A comprehensive evaluation suite for running inference and scoring model performance on our proposed ChartVRBench.

-----

## 🚀 Key Workflows and Scripts

This section details the core components of our framework and how they map to the paper's methodology.

### 1\. Generating the ChartVRBench Synthetic Dataset (`/make_data/`)

This module implements the automated data generation pipeline described in the paper, creating non-annotated charts from executable code.

  - **`qwen_gen_code_mp.py`**: The core script for generating chart plotting code. It leverages a powerful MLLM with Self-Instruct and Evol-Instruct prompts (`/prompt/`) to create a diverse library of Python scripts.
  - **`execute_code.py`**: Executes the generated Python scripts in a sandboxed environment to render high-quality, non-annotated chart images. It includes an AI-driven self-repair mechanism.
  - **`qwen_gen_qa.py`**: Generates question-answer pairs by parsing the ground-truth data directly from the executable code, ensuring verifiably correct Q\&A pairs.
  - **`batch_filter_image_qwen.py` & `batch_filter_qa.py`**: Quality control scripts that use a model-based judge to score the visual fidelity of images and the logical consistency of Q\&A pairs.

### 2\. Preparing Training Data for ChartVR (`/chart_cot/`)

This directory contains scripts to generate the detailed reasoning chains that are crucial for our two-stage training process.

  - **`make_thought_LLM.py`**: The primary script for generating high-quality Chain-of-Thought (CoT) data by distilling a step-by-step reasoning process from a teacher model.

### 3\. Training ChartVR (`/train/`)

This module contains the shell scripts to execute the two-stage **Reinforcement Finetuning (RFT)** strategy.

  - **Stage 1:(SFT)**

      - **Scripts**: `sft_qwen_3b.sh`, `sft_qwen_7b.sh`
      - **Purpose**: These scripts perform Supervised Fine-Tuning on the CoT data, teaching the model the fundamental structure of chart analysis.

  - **Stage 2: (GRPO)**

      - **Scripts**: `grpo_3b.sh`, `grpo_7b.sh`
      - **Purpose**: These scripts implement Group Relative Policy Optimization (GRPO), using the SFT model as a starting point to refine the model's accuracy with our custom reward function.

### 4\. Evaluating Models on ChartVRBench (`/test/`)

This module provides the tools to reproduce our evaluation results on our proposed ChartVRBench benchmark.

  - **`evaluate.py`**: The core evaluation script. It calculates the final accuracy scores based on the **relaxed accuracy metric** defined in the paper (2% relative error margin).
  - **Model-specific generators (`*.generate.py`)**: Specialized inference scripts used for evaluating various open-source and proprietary models on ChartVRBench.

### 5\. Evaluation on Public Benchmarks

To ensure a fair and standardized comparison, our evaluation of ChartVR on public benchmarks (e.g., CharXiv, ChartBench, ChartQAPro) strictly follows the official protocols, configurations, and evaluation scripts provided by the authors of each respective benchmark. As such, these benchmark-specific evaluation scripts are not included in this repository. We refer users to the official repositories of those benchmarks for their evaluation pipelines.

-----

## ⚠️ A Note on Anonymization and Reproducibility

Please be aware that this code is provided as supplementary material for a paper under anonymous review.

  - **Purpose**: The primary goal of this repository is to demonstrate the key workflows of our proposed methodology and the training pipeline for our ChartVR.
  - **Anonymization**: Some parts of the code and file paths may have been anonymized to protect the identity of the authors.
  - **Runnability**: Due to the anonymization process and variations in computational environments, the code may not be directly runnable out-of-the-box. Some environment settings, dependencies, or specific paths might be missing.

We are committed to full reproducibility. A complete, cleaned, and executable version of this repository will be made publicly available upon the publication of our paper.