# $A^\star$-PO

## Installation

```
conda create -n zero python=3.10
# install torch [or you can skip this step and let vllm to install the correct version for you]
pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu121
# install vllm
pip3 install vllm==0.6.3 # or you can install 0.5.4, 0.4.2 and 0.3.1
pip3 install ray

# verl
pip install -e .

# flash attention 2
pip3 install flash-attn --no-build-isolation
# quality of life
pip install wandb IPython matplotlib

# for math verification
pip install antlr4-python3-runtime==4.9.3
pip install antlr4-tools
pip install math-verify[antlr4_9_3]
pip install ujson
pip install tyro
```

## Data Preparation

```
conda activate zero

# GSM8K
python ./examples/data_preprocess/gsm8k.py --local_dir {path_to_your_dataset}

# MATH
python ./examples/data_preprocess/math_dataset.py --local_dir {path_to_your_dataset}
```

## Stage 1 Generation from $\pi_{ref}$

To generate 8 generations from $\pi_{ref}$ with 4 GPUs:

```
python model_generate.py --model_name MODEL_NAME --dataset PATH_TO_TRAIN_PARQUET_FILE --remote_dir HUGGINGFACE_REPO --n 8 --world_size 4
```

To evaluate the generated responses to determine the correctness:

```
python model_evaluate.py --dataset HUGGINGFACE_REPO_FROM_GENERATE --remote_dir HUGGINGFACE_REPO --reward_function {gsm8k or math} --n 8
```

## Stage 2 Training

**Activate Environment**

```
conda activate zero
```

**Start Job**

```
python3 -m verl.trainer.main_explore \
  algorithm.loss_type=without_g \
  algorithm.beta=1e-3 \
  algorithm.normalize_by_pi_ref=True \
  algorithm.kl_ctrl.kl_coef=0 \
  data.num_gen_to_use=8 \
  data.filter_incorrect=False \
  data.eta=2 \
  data.train_files=${path_to_huggingface_repo_from_evaluation} \
  data.val_files=${path_to_test_parquet_file} \
  data.max_prompt_length=256 \
  data.max_response_length=1024 \
  data.train_batch_size=256 \
  data.val_batch_size=500 \
  actor_rollout_ref.model.path=Qwen/Qwen2.5-${model_size}B \
  actor_rollout_ref.actor.optim.lr=1e-6 \
  actor_rollout_ref.actor.ppo_mini_batch_size=128 \
  actor_rollout_ref.actor.ppo_micro_batch_size=4 \
  actor_rollout_ref.actor.entropy_coeff=0 \
  actor_rollout_ref.rollout.log_prob_micro_batch_size=4 \
  actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
  actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \
  actor_rollout_ref.rollout.n=1 \
  actor_rollout_ref.ref.log_prob_micro_batch_size=4 \
  trainer.logger=['wandb'] \
  +trainer.val_before_train=True \
  trainer.default_hdfs_dir=null \
  trainer.n_gpus_per_node=4 \
  trainer.nnodes=1 \
  trainer.save_freq=50 \
  trainer.test_freq=50 \
  trainer.project_name=a_star \
  trainer.experiment_name=a_star_${DATASET}_${model_size} \
  trainer.total_epochs=25
```
