<div align="center">
<h1 align="center"> Scaling Generalist Data-Analytic Agents  </h1>
</div>

## Table of Contents

- [🌟Overview](#overview)
- [🔧Installation](#installation)
- [🚀QuickStart](#quickstart)
- [🧐Evaluation](#evaluation)
---


![alt text](./assets/method.png)

## 🌟Overview

Data-analytic agents are emerging as a key catalyst for automated scientific discovery and for the vision of Innovating AI. Current approaches, however, rely heavily on prompt engineering or multi-agent scaffolds over proprietary models, while open-source models still struggle with diverse-format, large-scale data files and long-horizon, multi-step reasoning that real-world analytics demands. This paper introduces **DataMind**, a scalable data synthesis and agent training recipe designed to construct generalist data-analytic agents. **DataMind** tackles three key challenges in building open-source data-analytic agents, including insufficient data resources, improper training strategy, and unstable code-based multi-turn rollout. 

Concretely, **DataMind** applies
- A fine-grained task taxonomy and a recursive easy-to-hard task composition mechanism to increase the diversity and difficulty of synthesized queries; 
- A knowledge-augmented trajectory sampling strategy followed by model-based and rule-based filtering; 
- A dynamically adjustable training objective combining both SFT and RL losses;
- A memory-frugal and stable code-based multi-turn rollout framework. 

Built on **DataMind**, we curate **DataMind-12K**, a high-quality trajectory set spanning diverse domains, task categories, and data file formats for data-analytic tasks. Trained on DataMind-12K, our DataMind-14B achieves state-of-the-art with an average score of 71.16\% on multiple data analysis benchmarks, outperforming the strongest proprietary baselines DeepSeek-V3.1 and GPT-5. Our DataMind-7B also performs best among all open-source models with a score of 68.10\%. We also list some empirical insights gained from our exploratory trials in the analysis experiments, aiming to provide actionable insights about agent training for the community. We will release DataMind-12K and DataMind-7B,14B for the community's future research.


## 🔧Installation
### verl
```bash
cd verl
USE_MEGATRON=0 bash scripts/install_vllm_sglang_mcore.sh
pip install -e .[vllm]
pip install -e .[sglang]
```

### eval
```bash
cd eval
pip install -r requirements.txt
```

## 🚀QuickStart
### Cold Start
We provide our dataset `datamind_12k_sample` in "datamind_12k_sample.zip". You can use LLaMA-Factory to finetune the model. There is an example:
```yaml
### model
model_name_or_path: Qwen/Qwen2.5-Coder-7B-Instruct
trust_remote_code: true

### method
stage: sft
do_train: true
finetuning_type: full
deepspeed: examples/deepspeed/ds_z3_config.json  # choices: [ds_z0_config.json, ds_z2_config.json, ds_z3_config.json]

### dataset
dataset: datamind_12k_sample
template: qwen
cutoff_len: 8192
overwrite_cache: true
preprocessing_num_workers: 16
dataloader_num_workers: 4

### output
output_dir: <your_output_dir>
logging_steps: 1
save_strategy: 'no'
plot_loss: true
overwrite_output_dir: true
save_only_model: false
report_to: none  # choices: [none, wandb, tensorboard, swanlab, mlflow]

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 4
learning_rate: 1.0e-5
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000
resume_from_checkpoint: null
```

### Reinforcement Learning
We use verl framework to conduct the reinforcement learning. We provide our sampled data in verl/data/rl_sample.zip.

You should modify the path in "agent/async_interpreter.py", "verl/utils/reward_score/sql.py", "verl/workers/rollout/sglang_rollout/sglang_rollout.py".

And then modify the infomation in multi.sh script. And run the script to start training.
```bash
bash multi.sh
```

Due to file size constraints, the Database and CSV files are not included in the code repository.

## 🧐Evaluation
You can use eval/model.sh to launch the model.
### For Python Evaluation
You should unzip the da-dev-tables.zip and tablebench_csv.zip.
Then you can modify the eval/python/eval.sh and run it to start Python evaluation.
```sh
PORT=19007
export OPENAI_BASE_URL=http://0.0.0.0:${PORT}/v1
export OPENAI_API_KEY=placeholder_key

python eval_python.py \
    --model datamind \
    --temperature 0.7 \
    --top_p 0.95 \
    --bs 5 \
    --test_bench dabench \
    --test_file test_file/daeval_test.parquet \
    --csv_or_db_folder da-dev-tables \
```

### For SQL Evaluation
First unzip eval/sql/test_file/bird_dev_csv_results.zip. Then modify the eval/sql/eval.sh and run it to start SQL evaluation.
```sh
PORT=19008
export OPENAI_BASE_URL=http://0.0.0.0:${PORT}/v1
export OPENAI_API_KEY=placeholder_key

python eval_bird.py \
    --model datamind \
    --temperature 0.7 \
    --top_p 0.95 \
    --bs 5 \
    --test_bench bird \
    --test_file bird/test_file/bird_dev.parquet \
    --csv_or_db_folder bird/dev_sqlite_files \
    --gold_csv_results_dir bird/bird_dev_csv_results \
    --db_schema_data_path bird/bird_dev_omni_ddl.json
```

Due to file size constraints, the Database and CSV files are not included in the code repository.