# Training with ModelScope-Swift

## Data Preparation
> **Note:**
> At present, the data-collection scripts must be executed from within the `lltm/training` directory.
> The copies shown here are for reference only.
> To actually run them, move the files back into the `lltm/training` directory.

You currently have the following files in `13_swift/data`:
- `get_data.py`
- `cr.py`
- `transforms.py`

Please move the files as follows:
- Move `get_data.py` to `lltm/training`
- Move `cr.py` and `transforms.py` to `lltm/training/ltm/data`
- After moving them, change your working directory to `lltm/training`

Once the files are in place, run the command below from inside `lltm/training`.
This will generate `.jsonl` files in the specified `datadir`, formatted according to the MS-Swift specification.

```bash
uv run python get_data.py --base [path_to_config_file] -n [int] --datadir [path_to_output_directory]
````

Example:

```bash
uv run python get_data.py --base configs/cr/additiontasksmall-ltm-0.5b.yaml -n 3000 --datadir /home/[user name]/lltm/data
```

## RL Training Method

### Dataset

The datasets are located at the paths below.
**Important:** CruxEval originally contains 800 items and LiveCodeBench 479 items.
However, Swift fetches data in chunks of
`per_device_eval_batch_size / num_generations * NPROC_PER_NODE`,
dropping any remainder.
To avoid truncation, dummy records have been added: CruxEval now has 804 items and LiveCodeBench 480 items.
Keep this in mind and adjust evaluation results accordingly.

* Pytracify (with corrupted Qwen2.5/3 entries removed): `/path/to/home/lltm-h200/data_stable/Pytracify_deleted.jsonl`
* CruxEval: `/path/to/home/lltm-h200/data_stable/CruxEval.jsonl`
* LiveCodeBench: `/path/to/home/lltm-h200/data_stable/LCB_all.jsonl`

### Training

First, launch a vLLM server.
Set `use_hf=true` to load the model from Hugging Face.
Specify the model name and add `model_type` as required (see the [supported models link](https://www.aidoczh.com/swift/en/Instruction/%E6%94%AF%E6%8C%81%E7%9A%84%E6%A8%A1%E5%9E%8B%E5%92%8C%E6%95%B0%E6%8D%AE%E9%9B%86.html)).

```bash
#!/bin/bash

CUDA_VISIBLE_DEVICES=6,7 \
swift rollout \
    --model Qwen/Qwen2.5-7B-Instruct \
    --model_type 'qwen2_5' \
    --use_hf true \
    --data_parallel_size 1 \
    --tensor_parallel_size 2 \
    > [path_to_log] 2>&1 &
```

Next, define the reward function.
Place the following code in `reward_funcs/exactmatch.py`:

```python
from swift.plugin import ORM, orms
import re

class ExactMatch(ORM):
    def __call__(self, completions, solution, **kwargs):
        rewards = []
        completion_answers = []
        for completion, sol in zip(completions, solution):
            try:
                # Check if the format is correct
                match = re.search(r"<answer>(.*?)<\/answer>", completion)
                if match is None:
                    rewards.append(0.0)
                    continue
                # Extract the "answer" part from the completion
                completion_answer = match.group(1)
                completion_answers.append(completion_answer)
                if completion_answer == sol:
                    rewards.append(1.0)
                else:
                    rewards.append(0.0)
            except Exception:
                # If evaluation fails, reward is 0
                rewards.append(0.0)
        return rewards

orms['exactmatch'] = ExactMatch
```

Then start training:

```bash
#!/bin/bash

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 \
NPROC_PER_NODE=6 \
swift rlhf \
    --rlhf_type grpo \
    --train_type full \
    --torch_dtype bfloat16 \
    --model Qwen/Qwen2.5-7B-Instruct \
    --model_type 'qwen2_5' \
    --use_hf true \
    --dataset '/path/to/home/lltm-h200/data_stable/Pytracify_deleted.jsonl' \
    --val_dataset '/path/to/home/lltm-h200/data_stable/CruxEval.jsonl' \
    --external_plugins /path/to/home/lltm-cp-h200/reward_funcs/exactmatch.py \
    --reward_funcs exactmatch \
    --use_vllm true \
    --vllm_mode server \
    --vllm_server_host 127.0.0.1 \
    --vllm_server_port 8000 \
    --learning_rate 1e-6 \
    --warmup_ratio 0.01 \
    --max_completion_length 8192 \
    --num_train_epochs 1 \
    --max_steps 10000 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --num_generations 8 \
    --beta 0.001 \
    --num_iterations 1 \
    --gradient_accumulation_steps 2 \
    --eval_steps 100 \
    --eval_limit 96 \
    --save_steps 100 \
    --save_total_limit 2 \
    --logging_steps 1 \
    --output_dir [output_path] \
    --dataloader_num_workers 4 \
    --temperature 1.0 \
    --top_p 0.9 \
    --top_k 50 \
    --deepspeed zero3
```

If training stops midway, resume from a checkpoint by adding:

```bash
    --resume_from_checkpoint /path/to/home/lltm-cp-h200/output/grpo_pytracify_qwen_2_5/v3-20250719-181956/checkpoint-1100
```

You generally do **not** need to specify this for the vLLM server (it syncs automatically), but adding it is safe.

### Evaluation

Set `gradient_accumulation_steps=2` and `eval_steps=1` to effectively run evaluation.

First, start the server:

```bash
#!/bin/bash

CUDA_VISIBLE_DEVICES=6,7 \
swift rollout \
    --model "/path/to/home/lltm-cp-h200/output/grpo_pytracify_qwen_2_5/v5-20250730-161822/checkpoint-2000" \
    --model_type 'qwen2_5' \
    --use_hf true \
    --data_parallel_size 1 \
    --tensor_parallel_size 2 \
    > /path/to/home/lltm-cp-h200/log/vllm.log 2>&1 &
```

Then run what is essentially the training command, but for evaluation:

```bash
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 \
NPROC_PER_NODE=6 \
swift rlhf \
    --rlhf_type grpo \
    --train_type full \
    --torch_dtype bfloat16 \
    --model "/path/to/home/lltm-cp-h200/output/grpo_pytracify_qwen_2_5/v5-20250730-161822/checkpoint-2000" \
    --model_type 'qwen2_5' \
    --use_hf true \
    --dataset '/path/to/home/lltm-h200/data_stable/Pytracify_deleted.jsonl' \
    --val_dataset '/path/to/home/lltm-h200/data_stable/LCB_all.jsonl' \
    --external_plugins /path/to/home/lltm-cp-h200/reward_funcs/exactmatch.py \
    --reward_funcs exactmatch \
    --use_vllm true \
    --vllm_mode server \
    --vllm_server_host 127.0.0.1 \
    --vllm_server_port 8000 \
    --learning_rate 1e-6 \
    --warmup_ratio 0.01 \
    --max_completion_length 8192 \
    --num_train_epochs 1 \
    --max_steps 1 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --num_generations 2 \
    --beta 0.001 \
    --num_iterations 1 \
    --gradient_accumulation_steps 2 \
    --eval_steps 1 \
    --eval_limit 96 \
    --save_strategy no \
    --eval_strategy steps \
    --logging_steps 1 \
    --output_dir output/grpo_pytracify_qwen_2_5_LCB_all \
    --dataloader_num_workers 4 \
    --temperature 1.0 \
    --top_p 0.9 \
    --top_k 50 \
    --deepspeed zero3
```
