# IterResearch: Rethinking Long-Horizon Agents via Markovian State Reconstruction

This repository contains the code and data of our paper for reproduction.

### Step1: Experiment Enviroment
To ensure reproducibility, we build the training environment based on the docker environment of the open-source Slime framework.

```bash
docker pull slimerl/slime:latest # This is an open-source environment and does not violate anonymity.
# install addtional packages
cd ./slime
pip install -e .

pip install -U "qwen-agent[rag,code_interpreter,mcp]" sandbox_fusion json5
```


### Step2: Download LLM from Huggingface
Please download the following models from huggingface:
```
Qwen3-30B-A3B
Qwen3-235B-A22B-Thinking
```

### Step3: Deploy Qwen3-235B-A22B with SGlang
```bash
# set your model path before launch the server
cd ./launch_llm
bash launch_sglang.sh
```

### Step4: Convert models from Huggingface to Megatron format
```bash
cd ./convert_hf_to_megatron
bash hf2mcore.sh
```

### Step5: SFT
```bash
# set parameters in qwen3-30B.sh before training
cd ./sft_scripts
bash qwen3-30B.sh
```


### Step6: RL

```bash
# set parameters in iterresearch.sh before training
cd ./rl_scripts
bash iterresearch.sh
```

## Case Study

We provide 50 randomly sampled trajectories from our experiments on the BrowseComp dataset with $T_{\max} = 2048$ in `./traj_case/bc_2048.jsonl`.

Due to the 100MB file size limit for supplementary materials, we have provided 50 randomly selected full-length trajectories instead of the complete set.

## Others
* We provide our sampled sub-data for sft and rl data in `./data`
* We will release our trained checkpoint after the review process.
* Special thanks to Slime, SGlang, and VLLM for their valuable work.


With the above information, we believe you can easily reproduce our work.