<div align="center">

# MARS: Optimizing Dual-System Deep Research via Multi-Agent Reinforcement Learning

</div>

This repo contains all codes and data of our paper titled "Optimizing Dual-System Deep Research via Multi-Agent Reinforcement Learning"

</div>

## 🎯 Data Details
We provide:
- RL training data (5,050 samples)
- Curated data (40K samples)
- Test data (HLE, single-hop QA, multi-hop QA)


```bash
./data/
├── curated_data
│   └── curated_data.jsonl
├── rl_train_data
│   └── rl_train.jsonl
└── test_data
    ├── hle.jsonl
    ├── mhqa.jsonl  # multi-hop qa
    └── shqa.jsonl  # single-hop qa
```

## 🐝 Step 1: Deploy LLM service (judge model)

### Step 1.1: Python Environment
```bash
conda create -n sg python=3.11
conda activate sg
cd ./llm_server
pip install -r llm_server_requirements.txt
```

### Step 1.2: Download LLM from Huggingface
Please download the following models from huggingface:
```
Qwen2.5-7B-Instruct
Qwen2.5-72B-Instruct
```

### Step 1.3: Deploy the LLM server
support `sglang_router` for multi-node deployment
```bash
# On Master node
cd ./llm_server
bash main_router.sh

# On Worker nodes
bash worker.sh  # change ip in worker.sh
```

## 🐝 Step 2: Deploy Sandbox
You can deploy the sandbox on **another docker** for safety.

### Step 2.1: Python Environment
```bash
git clone https://github.com/bytedance/SandboxFusion.git
conda create -n sandbox -y python=3.11  # don't use python=3.12 or 3.10
conda activate sandbox
poetry install
```
### Step 2.2: Launch the sandbox server
```bash
# to build the real docs, run `cd docs && npm ci && npm run build`
mkdir -p docs/build
make run-online
```

## 🐝 Step 3: RL Training

### Step 3.1: Python Environment
You can refer to the `requirements.txt` for the specific version of some packages but follow these steps to build your python env.
```bash
conda create -n mars python=3.10
conda activate mars

cd ./verl-search
pip install -e .

cd ./qwen-agent
pip install -e .

pip install vllm==0.8.5

pip install flash-attn --no-build-isolation
```

### Step 3.2: RL-zero Training
Before training, you should change the key and IP of (LLM Server in Step 1, Sandbox of Step 2, google search and google scholar in `https://serpapi.com/`) in `rl_train.sh`

```bash
cd ./rl_scripts
bash rl_train.sh
```

## ❤️ Acknowledgements

This work is built upon several excellent open-source projects. We sincerely thank:

- [VeRL](https://github.com/volcengine/verl) for providing the RL framework
- [vLLM](https://github.com/vllm-project/vllm) for the efficient inference engine in rollout with high throughput
- [SGLang](https://github.com/sgl-project/sglang) for the efficient inference engine in LLM Server with high throughput
- [Qwen-Agent](https://github.com/QwenLM/Qwen-Agent) for rollout process
- [Sandbox](https://github.com/bytedance/SandboxFusion) for safe sandbox

We express our gratitude to all these projects for their outstanding contributions to the open-source community.


We believe you can easily reproduce our work by following these steps ❤️.