# UTRL: Learning to Generate Unit Test via Adversarial Reinforcement Learning
---

## 🚀 Quick Start & Reproduction

### 📝 **Prerequisites**

#### **System Requirements**
- **GPU**: NVIDIA H100 80GB (recommended) + CUDA 12.4
- **OS**: Linux (Ubuntu 20.04+)

#### **Dependencies**
```bash
# Create conda environment
conda create -n utrl python==3.10
conda activate utrl

# install dependencies for verl
cd verl
USE_MEGATRON=0 bash scripts/install_vllm_sglang_mcore.sh
pip install --no-deps -e .
cd ..

# install additional dependencies
pip install -r requirements.txt
```

#### **Authentication Setup**
```bash
# Login to Hugging Face for model access
huggingface-cli login

# Login to Weights & Biases for experiment tracking
wandb login
```

### 🏃‍♂️ **Training model via UTRL**
We provide scripts for training Qwen3-4B via **UTRL**. Note that the RL training requires long time to run (e.g., 2 days for training Qwen3-4B via UTRL using 15K training samples in TACO dataset for 100 steps).

#### **1. UTRL - Iteration 1**
```bash
# Create data for training unit test generator
bash scripts/prepare_ut_data_iter_1.sh

# Train UT generator LLM via UTRL
bash scripts/train_ut_model_iter_1.sh

# Create data for training code generator
bash scripts/prepare_code_data_iter_1.sh ${ut generator ckpt step (50)}

# Train Code generator LLM via UTRL
bash scripts/train_code_model_iter_1.sh
```

#### **2. UTRL - Iteration 2**
```bash
# Create data for training unit test generator
bash scripts/prepare_ut_data_iter_2.sh ${code generator ckpt step (370)}

# Train UT generator LLM via UTRL
bash scripts/train_ut_model_iter_2.sh ${ut generator ckpt step (50)}
```

## 🏃‍♂️ Trraining model via SFT
```bash
# SFT with D_UT
python sft_train_testgen_dt.py \
    --data_path ${path for the dataset D_UT}
    --model_name Qwen/Qwen3-4B

# SFT with D_reason_UT
python sft_train_testgen_dt.py \
    --data_path ${path for the dataset D_reason_UT}
    --model_name Qwen/Qwen3-4B
```




### 📊 **Evaluation**
```bash
# step 0. Sample multiple code solutions (32 code solutions per task)
python -m inference.generate_solution \
    --solution_generation_model Qwen/Qwen3-4B \
    --target_path qwen3_4b \
    --best_of_n \
    --n_samples 32 \
    --tensor_parallel_size ${n_gpus to use} \

python -m inference.generate_solution \
    --solution_generation_model Qwen/Qwen3-8B \
    --target_path qwen3_8b \
    --best_of_n \
    --n_samples 32 \
    --tensor_parallel_size ${n_gpus to use} \

python -m inference.generate_solution \
    --solution_generation_model Qwen/Qwen3-14B \
    --target_path qwen3_14b \
    --best_of_n \
    --n_samples 32 \
    --tensor_parallel_size ${n_gpus to use} \

python -m inference.generate_solution \
    --solution_generation_model gpt-4o-2024-08-06 \
    --target_path gpt_4o \
    --best_of_n \
    --n_samples 32 \
    --tensor_parallel_size ${n_gpus to use} \


# step 1. Generate unit tests using the trained checkpoint
python -m inference.generate_unit_test \
    --test_generation_model ${model checkpoint path or HF model name} \
    --target_path ${signature of the model checkpoint, e.g., qwen3_4b_utrl} \
    --split test \
    --tensor_parallel_size ${n_gpus to use} \

# step 2. Evaluate best-of-N improvement
python -m evaluation.evaluate_bon_solution_stdio \
    --test_generation_model ${signature of the model checkpoint} \
    --solution_generation_model ${signature of the model used for code sampling, e.g., qwen3_4b, qwen3_8b, qwen3_14b, gpt_4o} \
    --best_of_n \
    --n_samples 32 \
    --split test
```