# AgentMath: Empowering Mathematical Reasoning for Large Language Models via Tool-Augmented Agent

## I. Introduction

Large Reasoning Models (LRMs) like o3 and DeepSeek-R1 have achieved remarkable progress in natural language reasoning with long chain-of-thought. However, they remain computationally inefficient and struggle with accuracy when solving problems requiring complex mathematical operations. In this work, we present AgentMath, an agent framework that seamlessly integrates language models' reasoning capabilities with code interpreters' computational precision to efficiently tackle complex mathematical problems. 

Our approach introduces three key innovations: 
- An automated method that converts natural language chain-of-thought into structured tool-augmented trajectories, generating high-quality supervised fine-tuning (SFT) data to alleviate data scarcity; 
- A novel agentic reinforcement learning (RL) paradigm that dynamically interleaves natural language generation with real-time code execution. This enables models to autonomously learn optimal tool-use strategies through multi-round interactive feedback, while fostering emergent capabilities in code refinement and error correction; 
- An efficient training system incorporating innovative techniques, including request-level asynchronous rollout scheduling, agentic partial rollout, and prefix-aware weighted load balancing, achieving 4-5× speedup and making efficient RL training feasible on ultra-long sequences with scenarios with massive tool calls. 

Extensive evaluations show that AgentMath achieves state-of-the-art performance on challenging mathematical competition benchmarks including AIME24, AIME25, and HMMT25, substantially outperforming frontier open‑source models of comparable size. Specifically, AgentMath-30B-A3B attains 90.6%, 86.4%, and 73.8% accuracy respectively, surpassing OpenAI-o3-mini and Claude-Opus-4.0-Thinking while remaining competitive with OpenAI-o3, Gemini-2.5-Pro, and DeepSeek-R1-671B-0528. These results validate the effectiveness of our approach and pave the way for building more sophisticated and scalable mathematical reasoning agents.

---

## II. Getting Started with AgentMath Training

### 1. Environment Setup

#### 1.1 AgentMath Environment Installation

```bash
conda create -n agentmath python==3.11 -y
conda activate agentmath
cd /Your_code_path/AgentMath
pip install Wikipedia-API PyPDF2 vllm==0.10.1
pip install transformers==4.56.1 accelerate==1.9.0 click==8.2.1 tiktoken==0.11.0 pyarrow==20.0.0 datasets==4.0.0
pip install flash-attn --no-build-isolation
git clone https://github.com/volcengine/verl.git
cd verl
pip install -e .
pip install liger-kernel math_verify json5 polars debugpy jsonlines sandbox-fusion mathruler
pip install -U "ray[default]"
```

#### 1.2 Sandbox Environment Installation

```bash
conda create -n sandbox-runtime python==3.11 -y
conda activate sandbox-runtime
git clone https://github.com/bytedance/SandboxFusion.git
cd SandboxFusion
pip install scs==3.2.7 h5py==3.14.0
pip install -r runtime/python/requirements.txt 
pip install torch torchvision
pip install flash-attn --no-build-isolation
pip install poetry 
pip install sandbox-fusion
poetry install  
mkdir -p docs/build
```

#### 1.3 Starting the Sandbox Server

```bash
cd SandboxFusion
conda activate sandbox-runtime && make run-online
```

**Note**: For multi-node setups, install and start the Sandbox service on each machine, running it in the background.

---

### 2. Tool-Augmented Data Synthesis Pipeline

#### 2.1 Download Dataset

Our paper primarily uses open-source data such as `a-m-team/AM-DeepSeek-R1-0528-Distilled`. First, download this dataset from Hugging Face:

```bash
git clone https://huggingface.co/datasets/a-m-team/AM-DeepSeek-R1-0528-Distilled
```

#### 2.2 Generate Tool-Augmented Prompts

Use our tool-augmented synthesis prompts, which focus on replacing complex and error-prone calculations with tool calls:

```bash
python agentmath/data_synthesis/synthesis_prompt.py
```

#### 2.3 Synthesize Data with DeepSeek-V3

Since we deploy DeepSeek-V3-0324 for data synthesis based on company internal infrastructure, it's necessary to download the DeepSeek-V3-0324 model weights. Then, set up a service using frameworks like vLLM or SGLang for data synthesis, for example:
```bash
vllm serve deepseek-ai/DeepSeek-V3-0324
```

#### 2.4 Refine Synthesized Data

Apply multi-dimensional quality refinement to the synthesized data:

```bash
python agentmath/data_synthesis/process_synthesis_data.py
```

#### 2.5 Supervised Fine-Tuning

After completing the above steps, use the LlamaFactory framework to perform SFT training and obtain the AgentMath SFT model.

---

### 3. Agent Reinforcement Learning Training

#### 3.1 Prepare RL Training Data

We collect challenging problems mainly from multiple high-quality open-source RL datasets: **DeepScaler**, **Skywork-OR1**, **DAPO**, and **POLARIS**. 

The following example demonstrates data format conversion and RL training using the DAPO 17k dataset. The same procedure applies to other datasets.

**Preprocess DAPO 17k data:**

```bash
python agentmath/data_synthesis/agentmath_rl_data_process.py
```

#### 3.2 Initialize Ray Cluster

**On the head node:**

```bash
cd /root/Your_AgentMath_Path
conda activate agentmath
ray start --head --node-ip-address Your_Main_Node_IP --num-gpus 8
```

**On worker nodes:**

```bash
EXP_IP_LIST="IP1,IP2,IP3,..."
pdsh -w $EXP_IP_LIST "conda activate agentmath; ray start --address Your_Main_Node_IP:6379 --num-gpus 8"
```

#### 3.3 Three-Stage RL Training

**Configuration Notes:**
- The following examples use Qwen3-8B-Base. For other model sizes (Qwen3-1.7B-Base, Qwen3-30B-A3B-Base, etc.), follow the same procedure.
- Configure `IP_Address_strings` with your node IPs in the format: `your_node_ip1,your_node_ip2,your_node_ip3,...`
- Ensure the Sandbox service is running on all nodes. We recommend using `tmux` for process management.

---

**Stage 1: AgentMath 48k Maximum Output Length, 48 Tool Calls**

```bash
ray job submit --address="http://127.0.0.1:8265" \
  --runtime-env-json='{"env_vars": {"IP_Address_strings": "your_node_ip1,your_node_ip2,your_node_ip3,...", "TOKENIZERS_PARALLELISM": "true", "NCCL_DEBUG": "WARN", "VLLM_LOGGING_LEVEL": "WARN", "MKL_SERVICE_FORCE_INTEL": "1", "NCCL_SOCKET_IFNAME": "bond1", "VLLM_USE_V1": "1"}}' \
  -- bash agentmath/train_scripts/run_agentmath_qwen3_8b_train_48k_length_48tools.sh /root/your_model_path/agent_math_SFT_ckpt_path
```

**Stage 2: AgentMath 72k Maximum Output Length, 72 Tool Calls**

```bash
ray job submit --address="http://127.0.0.1:8265" \
  --runtime-env-json='{"env_vars": {"IP_Address_strings": "your_node_ip1,your_node_ip2,your_node_ip3,...", "TOKENIZERS_PARALLELISM": "true", "NCCL_DEBUG": "WARN", "VLLM_LOGGING_LEVEL": "WARN", "MKL_SERVICE_FORCE_INTEL": "1", "NCCL_SOCKET_IFNAME": "bond1", "VLLM_USE_V1": "1"}}' \
  -- bash agentmath/train_scripts/run_agentmath_qwen3_8b_train_72k_length_72tools.sh /root/your_model_path/agent_math_SFT_ckpt_path
```

**Stage 3: AgentMath 96k Maximum Output Length, 96 Tool Calls**

```bash
ray job submit --address="http://127.0.0.1:8265" \
  --runtime-env-json='{"env_vars": {"IP_Address_strings": "your_node_ip1,your_node_ip2,your_node_ip3,...", "TOKENIZERS_PARALLELISM": "true", "NCCL_DEBUG": "WARN", "VLLM_LOGGING_LEVEL": "WARN", "MKL_SERVICE_FORCE_INTEL": "1", "NCCL_SOCKET_IFNAME": "bond1", "VLLM_USE_V1": "1"}}' \
  -- bash agentmath/train_scripts/run_agentmath_qwen3_8b_train_96k_length_96tools.sh /root/your_model_path/agent_math_SFT_ckpt_path
```

---

### 4. Model Evaluation

Evaluate AgentMath on **AIME24**, **AIME25**, and **HMMT25** benchmarks. We run 32 independent trials and report avg@32 as the pass@1 metric.

```bash
conda activate agentmath
ray start --head
cd /root/Your_AgentMath_Path

ray job submit --address="http://127.0.0.1:8265" \
  --runtime-env-json='{"env_vars": {"IP_Address_strings": "your_node_ip1,your_node_ip2,your_node_ip3,...", "TOKENIZERS_PARALLELISM": "true", "NCCL_DEBUG": "WARN", "VLLM_LOGGING_LEVEL": "WARN", "MKL_SERVICE_FORCE_INTEL": "1", "NCCL_SOCKET_IFNAME": "bond1", "VLLM_USE_V1": "1"}}' \
  -- bash agentmath/eval_scripts/run_agentmath_eval.sh /root/your_model_path/agent_math_ckpt_path rl_math
```


---

## III. Codebase Overview

The core functionality of AgentMath is implemented in this repository:

- **Data Synthesis Pipeline**: Tool-augmented data synthesis code located in `agentmath/data_synthesis`
- **RL Training Modules**: Efficiency optimization components including Request-Level Asynchronous Rollout Scheduling, Agentic Partial Rollout, and Prefix-Aware Weighted Load Balancing in `agentmath/agent_dapo_ray_trainer_patial_rollout.py` and `agentmath/agent_loop`
- **Running Scripts**: Training scripts in `agentmath/train_scripts`, evaluation scripts in `agentmath/eval_scripts`

**Note**: Due to time constraints,  this codebase may use some hard-coded elements (including the synthesis process, RL training, and data paths), and some code parts may be missing. If you encounter any issues during execution, please provide feedback actively. We sincerely apologize for this and kindly ask for your understanding. We commit to continuously optimizing this repository, improving code quality, and making the training process more robust, smoother, and stably runnable.

---
## IV. Notes and Future Plans

**Regarding the synthesized data and model checkpoints, their release is subject to our company's open-source policy review process to ensure compliance. We are actively advancing the company review process.**

To facilitate researchers in reproducing our research results, we open-source the main code related to the AgentMath method here, including relevant prompts, training hyperparameters, data synthesis pipeline, and detailed implementation workflow to improve the transparency and effectiveness of our work. Once we receive company approval, we will immediately release the synthesized data and models to support further research in the Agent LLM community.

We sincerely apologize for any inconvenience caused and welcome you to continue following our progress. We look forward to discussing the specific implementation details of our paper with you and apologize again for any inconvenience.

---

