## Quick Start

```bash
pip install -r requirements.txt
````

---

**Make the script executable and run:**

````bash
chmod +x problem_generation.sh
./problem_generation.sh
````

---

## Self-Play Pipeline (Code Example)

We illustrate the self-play workflow in the **code domain**, where unit tests provide verifiable reward signals.

---

**Step 1 — Verifiable Reward Generation (test case construction)**  
The input `.jsonl` file must include a `"problem"` field for each instance, specifying the coding task to be solved.  
In each run, a new test case is generated and appended to the `"completions"` field, progressively enriching the specification.  

````bash
# Generate 4 rounds of test cases with different seeds
for seed in {0..3}; do
  python test_cases_generation.py \
    --seed $seed \
    --data_path code/prompts_test_cases_${seed}.jsonl \
    --output_path code/prompts_test_cases_$((seed+1)).jsonl \
    --model_path Qwen/Qwen3-32B \
    --n_gpus 4 \
    --temperature 0.6 \
    --max_len 16384 \
    --use_chat_template True
done
````

Post-process the generated test cases into a structured format:

````bash
python test_cases_postprocess.py \
  --input_file code/prompts_test_cases_4.jsonl \
  --output_path code/prompts_test_cases_processed.jsonl
````

---

**Step 2 — Self-Play Trajectory Collection**
Using the processed test cases, generate diverse trajectories by sampling across multiple seeds:

````bash
for seed in {0..7}; do
  python infer_self_play.py \
    --data_path code/selfplay_${seed}.jsonl \
    --output_path code/selfplay_$((seed+1)).jsonl \
    --model_path Qwen/Qwen3-30B-A3B-Thinking-2507 \
    --trust_remote_code True \
    --n_gpus 8 \
    --num_splits 4 \
    --num_completions 8 \
    --seed $seed \
    --temperature 1.2 \
    --max_len 81920 \
    --use_chat_template True
done
````

---

**Step 3 — Reward Assignment**
Evaluate each trajectory against the constructed test cases and assign reward signals automatically:

````bash
python self_play_eval.py \
  --data_path code/selfplay_8.jsonl \
  --output_path code/selfplay_verified.jsonl \
  --eval_type code \
  --num_workers 16
````

---

**Step 4 — Pair Construction**
Aggregate verified trajectories into **chosen vs. rejected** pairs for offline self-play training:

````bash
python prepare_self_play_data.py \
  --data_path code/selfplay_verified.jsonl \
  --output_path code/selfplay_training.jsonl
````

---

## SFT Pipeline (Code Example)

We illustrate the SFT workflow in the **code domain**, using teacher trajectories from GPT-OSS-120B.

---

**Step 1 — Teacher Trajectory Collection**
Sample teacher responses for each prompt, with one trajectory per problem:

````bash
python infer_self_play.py \
  --data_path code/prompts_test_cases_processed.jsonl \
  --output_path code/prompts_trajectories.jsonl \
  --model_path openai/gpt-oss-120b \
  --trust_remote_code True \
  --n_gpus 8 \
  --num_splits 4 \
  --num_completions 1 \
  --seed 0 \
  --temperature 1.0 \
  --max_len 16384 \
  --use_chat_template True
````

---

**Step 2 — Data Post-Processing**
Filter incomplete or invalid trajectories, and format them into clean prompt–completion pairs for supervised fine-tuning:

````bash
python prepare_sft_data_code.py \
  --data_path code/prompts_trajectories.jsonl \
  --output_path code/sft_training.jsonl \
  --tokenizer_path Qwen/Qwen2.5-7B-Instruct
````
