# Codebase used for running experiments

We modified [trl](https://github.com/huggingface/trl) to implement a new `grpo_masked` loss type, the frequency update of the vLLM server, and other features detailed in our submission.
In this folder we provide:
- our modified trl
- our training script for gsm8k including system prompt and rewards functions
- a shell script for running the experiments
- an example of yaml configuration

Below is the configuration for the experiments (`gsm8k_exp_grpo.yaml`)

```yaml
common:
  loss_type: grpo
  beta: null
  num_iterations: null
  vllm_update_model_eve: null
  update_ref_model_eve: null
  output_dir: null
  run_name: null
  model_name_or_path: Qwen/Qwen2.5-0.5B-Instruct
  learning_rate: 5e-6
  do_eval: true
  adam_beta1: 0.9
  adam_beta2: 0.99
  weight_decay: 0.1
  warmup_ratio: 0.1
  lr_scheduler_type: cosine
  logging_steps: 1
  eval_steps: 10
  eval_strategy: steps
  bf16: true
  per_device_train_batch_size: 16
  per_device_eval_batch_size: 16
  gradient_accumulation_steps: 4
  num_generations: 16
  max_prompt_length: 256
  max_completion_length: 200
  num_train_epochs: 1
  save_steps: 400
  max_grad_norm: 0.1
  log_on_each_node: false
  use_vllm: true
  vllm_gpu_memory_utilization: 0.3
  vllm_device: "cuda:0"
  report_to: tensorboard
# experiments
gsm8k-exp1:
  run: true
  beta: 0.1
  num_iterations: 1
  vllm_update_model_eve: 1
  update_ref_model_eve: -1
  output_dir: outputs/g2.5-0.5B-Instruct-grpo-v1-i1
  run_name: g2.5-0.5B-Instruct-grpo-v1-i1
gsm8k-exp2:
  run: true
  beta: 0.1
  num_iterations: 10
  vllm_update_model_eve: 10
  update_ref_model_eve: -1
  output_dir: outputs/g2.5-0.5B-Instruct-grpo-v10-i1
  run_name: g2.5-0.5B-Instruct-grpo-v10-i1
gsm8k-exp3:
  run: true
  loss_type: grpo_masked
  beta: 0.1
  num_iterations: 1
  vllm_update_model_eve: 10
  update_ref_model_eve: -1
  output_dir: outputs/g2.5-0.5B-Instruct-grpo-v1-i1-gm
  run_name: g2.5-0.5B-Instruct-grpo-v1-i1-gm
```

## Installation

To run the code, create a Python virtual environment with `uv`. To install `uv` follow [this](https://docs.astral.sh/uv/getting-started/installation/).

```shell
uv venv trl_op --python 3.11 && source trl_op/bin/activate && uv pip install --upgrade pip
uv pip install vllm==0.8.4
uv pip install setuptools && uv pip install flash-attn --no-build-isolation
uv pip install deepspeed tensorboard
```
Next, install our modified trl:
```shell
cd trl
uv pip install -e ".[dev]"
```

To run the shell script you will also need to install the program `yq` which can be installed following [this](https://github.com/mikefarah/yq?tab=readme-ov-file#install)

## Run experiments

Assuming a server with 8GPUs, we will use one GPU for sampling and 7 GPUs to train.

First start the vLLM server:
```
CUDA_VISIBLE_DEVICES=0 trl vllm-serve --model Qwen/Qwen2.5-0.5B-Instruct

```

then start the training using `launch-gsm8k-jobs.sh`:

```
PYTHONPATH=$(realpath trl) ./launch-gsm8k-jobs.sh gsm8k_exp_grpo.yaml  
```
This will start the three experiments defined in `gsm8k_exp_grpo.yaml` sequentially.
