# Visual Programmability: Code-as-Thought for Chart Understanding

<p align="center">
  <img src="figure/teaser.png" alt="Code-as-Thought Framework" width="800"/>
</p>

Chart understanding presents a critical test to the reasoning capabilities of Vision-Language Models (VLMs). Prior approaches face critical limitations: some rely on external tools, making them brittle and constrained by a predefined toolkit, while others fine-tune specialist models that often adopt a single reasoning strategy, such as text-based chain-of-thought (CoT). The intermediate steps of text-based reasoning are difficult to verify, which complicates the use of reinforcement-learning signals that reward factual accuracy. To address this, we propose a Code-as-Thought (CaT) approach to represent the visual information of a chart in a verifiable, symbolic format. Our key insight is that this strategy must be adaptive: a fixed, code-only implementation consistently fails on complex charts where symbolic representation is unsuitable. This finding leads us to introduce Visual Programmability: a learnable property that determines if a chart-question pair is better solved with code or direct visual analysis. We implement this concept in an adaptive framework where a VLM learns to choose between the CaT pathway and a direct visual reasoning pathway. The selection policy of the model is trained with reinforcement learning using a novel dual-reward system. This system combines a data-accuracy reward to ground the model in facts and prevent numerical hallucination, with a decision reward that teaches the model when to use each strategy, preventing it from defaulting to a single reasoning mode. Experiments demonstrate strong and robust performance across diverse chart-understanding benchmarks. Our work shows that VLMs can be taught not only to reason but also how to reason, dynamically selecting the optimal reasoning pathway for each task.

---

## Environment Setup
Create and activate a clean conda environment, then install the required dependencies:

```bash
conda create -n cat python=3.10 -y
conda activate cat
pip install -r requirements.txt
```

---

## Dataset Preparation
Datasets should be in Hugging Face Parquet format with the following required fields:
- `images`: list of images as bytes dictionaries, e.g. `[{"bytes": ...}]`
- `prompt`: text prompt (include `<image>` token when an image is present)
- `ground_truth`: target answer string (some reward functions expect specific tags like `<answer>...</answer>`, `<csv>...</csv>`, `<programability>yes|no</programability>`)

We provide conversion scripts in `my_dataset/` for popular chart understanding datasets (ChartBench/ChartQA/CharXiv). Simply edit the script constants to point to your local raw data directory and run the script to generate `benchmark_*.parquet` files.

---

## Training
To train the model, configure and run the provided training script:

```bash
bash examples/qwen2_5vl_7b.sh
```

**Important Configuration:**
- Configure these variables in the script according to your setup: `MODEL_PATH`, `TRAIN_DATA`, `VAL_DATA`, `EXPERIMENT_NAME`, `FORMAT_PROMPT`, `REWARD_FUNCTION`, `NUM_GPUS`, and optionally `TENSORBOARD_DIR`
- The script uses `python -m verl.trainer.main` with decision prompt and decision reward by default. Modify parameters as needed for your specific requirements.

---

## Evaluation
To evaluate the trained model, configure and run the validation script:

```bash
bash examples/val_sh/val_chartbench.sh
```

**Configuration Requirements:**
- Set the following variables: `MODEL_PATH`, `TRAIN_DATA`, `VAL_DATA`, `FORMAT_PROMPT`, `REWARD_FUNCTION`, `NUM_GPUS`, and `VAL_OUTPUT_FILE`
- This script runs in validation-only mode (`trainer.val_only=true`) and outputs detailed generations and evaluation metrics.

---

## Repository Structure
- `examples/format_prompt/`: Jinja2 template prompts for code generation, chain-of-thought, and decision making
- `examples/reward_function/`: reward functions corresponding to different prompt templates
- `examples/config.yaml`: default training configuration
- `examples/qwen2_5vl_7b.sh`: training script example for Qwen2.5-VL-7B model
- `examples/val_sh/val_chartbench.sh`: validation script example for ChartBench evaluation
- `my_dataset/`: data conversion scripts to transform raw datasets into Parquet format
- `scripts/model_merger.py`: utility to merge FSDP model shards and export Hugging Face compatible weights
- `verl/`: core training framework integrating Ray, FSDP, and vLLM
- `requirements.txt`: Python package dependencies

---

## Acknowledgements
- This work is built upon the [EasyR1](https://github.com/hiyouga/EasyR1) training framework, which provides the efficient and scalable RL training infrastructure.
- We gratefully acknowledge the open-source communities and contributors of [HuggingFace Transformers](https://github.com/huggingface/transformers), [vLLM](https://github.com/vllm-project/vllm), [Ray](https://github.com/ray-project/ray), [FlashAttention](https://github.com/Dao-AILab/flash-attention), and [Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5) for making this research possible.

