### KL Training Script Guide

This document describes an example script `supp_materia/kl_exp.sh` for PPO training with KL regularization using OpenRLHF. It uses `Qwen2.5-Math-7B` as the default base model, aligns on a math dataset, and obtains reward signals by invoking an external evaluation script via `--remote_rm_url`.

---

### Prerequisites
- A multi-GPU environment is available (the script assumes 8 GPUs per node by default).
- Conda environment: `openrlhf` (the script runs `conda activate openrlhf`).
- Python dependencies (example; adjust versions as needed):
  ```bash
  pip install openrlhf vllm deepspeed ray transformers datasets accelerate sentencepiece
  ```
- A pretrained model accessible locally or from a model hub, e.g., `Qwen2.5-Math-7B`.
- Reward model/scoring service entry: the script uses `--remote_rm_url /script/mathverify.py`. Ensure the path exists and is executable, or replace it with your own entry.

---

### Data Preparation
- Default dataset file: `Openr1-Math-7k-4096.jsonl`
  - JSONL format. Each line should include keys:
    - `question` (maps to `--input_key`)
    - `label` (maps to `--label_key`)
- Paths are relative to the current working directory. If your data is elsewhere, set an absolute path in `--prompt_data`.

---

### Key Environment Variables and Script Arguments
- Required environment variable:
  - `KL_COEF`: Initial KL coefficient (passed to `--init_kl_coef`).
- Variables you can modify at the top of the script:
  - `MODEL_SIZE`: default `7B`.
  - `MODEL_PATH`: default `Qwen2.5-Math-${MODEL_SIZE}`.
  - `KL_TYPE`: default `k1`, used by `--kl_estimator` (can change to `k2`, `k3`).
- Output (defaults point to the root `/`; change to a writable path as needed):
  - Logs: `/${KL_COEF}-${MODEL_SIZE}-${KL_TYPE}-${timestamp}.log`
  - Checkpoints: `--save_path` and `--ckpt_path` both point to `/${KL_COEF}-${MODEL_SIZE}-${KL_TYPE}-${timestamp}`

---

### Example Run
```bash
# 1) Prepare environment
conda activate openrlhf

# 2) Set KL coefficient (example)
export KL_COEF=0.5

# 3) (Optional) Edit the script to change log and save paths from root to your workspace, e.g., /workspace/outputs
#    Replace the following three in kl_exp.sh:
#    logfile="/...", --save_path /..., --ckpt_path /...

# 4) Launch
bash supp_materia/kl_exp.sh
```

After launch, logs are printed to the console and written via `tee` to the specified log file.

---

### Main Training Arguments (Excerpt)
- `--ref_num_nodes 1`, `--ref_num_gpus_per_node 8`: reference model process topology.
- `--actor_num_nodes 1`, `--actor_num_gpus_per_node 8`: actor-side topology.
- `--vllm_num_engines 8`, `--vllm_tensor_parallel_size 1`: vLLM inference concurrency and tensor parallelism.
- `--vllm_gpu_memory_utilization 0.6`: target GPU memory utilization for vLLM.
- `--init_kl_coef ${KL_COEF}`, `--use_kl_loss`, `--kl_estimator ${KL_TYPE}`: KL regularization settings.
- `--advantage_estimator group_norm`: advantage normalization strategy.
- `--pretrain ${MODEL_PATH}`: base model path/identifier.
- `--remote_rm_url /script/mathverify.py`: external reward service entry (local or remote URL).
- `--micro_train_batch_size 32`, `--train_batch_size 256`: micro-batch and global batch sizes.
- `--micro_rollout_batch_size 32`, `--rollout_batch_size 32`, `--n_samples_per_prompt 8`: rollout/generation batch settings.
- `--prompt_max_len 1024`, `--generate_max_len 3072`: max input and generation lengths.
- `--max_epochs 1`, `--max_samples 999999999`: epochs and sample upper bound.
- `--zero_stage 3`, `--bf16`, `--gradient_checkpointing`: DeepSpeed, mixed precision, and checkpointing.
- `--apply_chat_template`, `--packing_samples`: chat template and sample packing.
- `--prompt_data Openr1-Math-7k-4096.jsonl`, `--input_key question`, `--label_key label`: data and field mapping.
- `--vllm_sync_backend nccl`, `--enforce_eager`, `--vllm_enable_sleep`, `--deepspeed_enable_sleep`: distributed/scheduling optimizations.

Tune these configurations to match your hardware and task requirements.

---

### Outputs and Artifacts
- Logs: `${logfile}` (e.g., `/0.02-7B-k1-20250101-1200.log`).
- Checkpoints and weights: directories pointed to by `--save_path` and `--ckpt_path` (timestamped).
- Optional: `--save_hf_ckpt` saves in a HuggingFace-compatible format.

---

### FAQ
- Permission denied when writing outputs:
  - By default, logs and weights are written under `/`. Set `logfile`, `--save_path`, and `--ckpt_path` to a directory you can write to (e.g., `/workspace/outputs/...`).
- Dataset file not found:
  - Verify the `--prompt_data` path; using an absolute path is recommended.
- GPU count/topology mismatch:
  - Adjust `--actor_num_gpus_per_node`, `--ref_num_gpus_per_node`, `--vllm_num_engines`, etc., according to your resources.
- Reward service unavailable:
  - Check that `--remote_rm_url` points to an existing and usable script/service; replace with your own RM service address or local script if necessary.

---

### Customization and Extension
- Change the base model: edit `MODEL_PATH` (e.g., `Qwen2.5-Math-32B`) and ensure sufficient hardware.
- Change the KL estimator: edit `KL_TYPE` (e.g., `k1`, `k2`, `k3`, `forward_kl`, `reverse_kl`, based on OpenRLHF support).
- Adjust sequence lengths and batch sizes: tune `--prompt_max_len`, `--generate_max_len`, `--train_batch_size`, etc., to fit memory.
- Replace/enhance the reward model: point `--remote_rm_url` to your RM service or script to implement custom evaluation and reward signals.

---

### Reproducibility and Log Analysis
- Since logs are mirrored via `tee`, quickly inspect key stats with:
  ```bash
  grep -E "(loss|kl|reward|lr)" /path/to/your.log | tail -n 50
  ```
- Consider using visualization tools (e.g., TensorBoard or Weights & Biases) to track training curves.

---

### Disclaimer
This script is provided for example and reference only. Adapt it to your environment and objectives, and follow all applicable licenses for data and models.


