# Motivation

Efficiently adjusting the alignment strength of language models without incurring the high cost of full retraining is an increasingly important challenge.  This assignment focuses on a training-efficient method for realigning a base model with its aligned counterpart.




# Task

You are provided with a reference policy $\pi^{\text{ref}}(y \mid x)$, represented by DeepSeek-R1-Distilled-Qwen-1.5B, and an already aligned model $\pi_\theta(\beta)(y \mid x)$, represented by DeepScaleR-Preview-1.5B. The aligned model is obtained by further training the reference model and demonstrates enhanced efficiency-oriented reasoning capabilities.

The performance is as follows: 

| Models                                 | AIME24 Avg@32 | AIME24 #Token | Token Reduction % |
|----------------------------------------|---------------|---------------|-------------------|
| DeepSeek-R1-Distill-Qwen-1.5B          | 18.33         | 12415         | --                | 
| DeepScaleR-1.5B-Preview                | 26.77         | 8533          | 31.27             | 


Your task is to implement and evaluate a method for efficient realignment between these two models. The goal is to develop a training-efficient approach that can further adjust the alignment strength, ultimately achieving improved efficiency-oriented reasoning capabilities over the reference model. In this task, you should design a algorithm to adjust the alignment strength of the model.

You need to refactor the LLaMA-Factory repository to integrate your proposed method.

## Implementation Guidelines

You are required to design and implement the **DualAlign** algorithm for efficient model realignment. You have access to:

1. **Reference Model:** The base model that needs alignment adjustment (DeepSeek-R1-Distilled-Qwen-1.5B)
2. **Aligned Model:** A model that has already been aligned and shows improved performance (DeepScaleR-Preview-1.5B)  
3. **Target:** Train a new model that can achieve better efficiency-oriented reasoning

## Implementation Requirements:

1. **Algorithm Design:** Create an innovative training method that leverages both the reference and aligned models to improve alignment strength.

2. **Framework Integration:** Implement your method in the LLaMA-Factory framework by creating a new training stage called `dualalign`.

3. **Training Configuration:** Use the provided configuration in `/workspace/task/scripts/train.yaml` with your custom implementation.

**Note:**

You should design a novel approach to utilize information from both models during training. Consider how to effectively combine their knowledge without simply copying existing methods. The goal is to develop an algorithm that can flexibly control alignment strength and achieve superior performance.

You should work under the `/workspace/task` and `/workspace/data` directories.

The directory `/workspace/task/repositories/LLama-Factory` contains multiple `README.md` files. You are encouraged to read them to better understand the training framework.

You should output the following files: 

- `/workspace/data/outputs/result.parquet`: The inference result produced by your trained model. This file should contain:
  - `output` column: String responses from your trained model
  - Same order and number of rows as the test dataset
  - Proper pandas DataFrame format
  - Contain all the thinking process and the final answer in the `output` column


# Data

Training Data: `/workspace/data/datasets/long_cot_calibration.json`

Test Data: `/workspace/data/datasets/aime-2024.parquet`


## Constraint

- Training: Maximum 400 training steps with batch size 16
- Context Length: Training on 4k-8k context, evaluation on up to 16k context  
- 8 x 80G GPUs


## Evaluation

We will the accuracy and effency and return a score based on these metrics.

## Environment

We have setup the conda enviroment for you named `/workspace/conds`, and we have activated the env. In this env, we installed the packages to use llama-factory and vllm.

File structure:
```
/workspace/
├── data/
|   ├── checkpoints/                            # your trained model  
|   |   ├──  DeepSeek-R1-Distilled-Qwen-1.5B
|   |   └──  DeepScaleR-Preview-1.5B
|   ├── dataset/
|   |   ├── aime-2024.parquet                   # do not modify this file
|   |   └── long_cot_calibration.jsonl           # do not modify this file
|   └── output/
└── task/
    ├── repositories/                          
    |   └── LLaMA-Factory                      # you can modify the src/llamafactory/train/dpo/trainer.py and other related files
    ├── scripts/      
    |   ├── llm.py                         # do not modify this file
    |   ├── train.yaml                     
    |   ├── train.sh                     
    |   └── eval_aime24.py                 # do not modify this file      
    └── task_description.md
```


## Scripts

Evaluation: Execute the following command to evaluate your trained model and obtain the results.

```bash 
cd /workspace/task/scripts
serve run llm:build_app model=your_model_path/model_name tensor-parallel-size=1

# open another terminal
python /workspace/task/scripts/eval_aime24.py --temperature 0.7 --top_p 0.95 --max_tokens 16384 --model model_name --test_file /workspace/data/datasets/aime-2024.parquet
```

**Scripts**

`/workspace/task/scripts/eval_aime24.py`: This is the evaluation script. Use the --model model_name argument to specify the model for inference. Note that model_name refers to the model's name only, without including the file path.

`/workspace/task/scripts/llm.py`: This is the VLLM-based parallel inference engine. For example, if you have 4 GPU cards and set --tensor-parallel-size=1, each GPU will hold a full replica of the model and perform inference in parallel.

`/workspace/task/scripts/train.yaml`: This is the training configuration file. You can modify the file to fit your needs.

`/workspace/task/scripts/train.sh`: This is the training script. You can modify the script to fit your needs.