# Align and Adapt: Enhancing LLM Format Alignment and Knowledge Adaptation via Reverse Constraints Generation

Our framework is designed to upgrade existing long-form QA dataset into instruction-tuning dataset. The dataset can align model format-following capability and adapt model to new domain specific knowledge.

## System Requirements
- 4 * RTX 3090
- Ubuntu 22.04

## Environment
- Python = 3.10.12
- Miniconda as virtual environment

To install all the required packages please use the commandline below
```
pip install -r requirement_file/deepspeed_requirements.txt
```

## The dataset folder should be organized as follows
```
├── configs
│   ├── config.yml
│   └── finetune
│       └── llama.yml
├── dataset
│   └── natural_question
│       ├── natural_question.jsonl
│       ├── natural_question_01.jsonl
│       ├── natural_question_02.jsonl
│       ├── natural_question_03.jsonl
│       ├── natural_question_04.jsonl
│       └── natural_question_finetune_llama.jsonl
├── evaluations
│   ├──google-research
│   ├──LiveBench
│   └──Multi-IF
├── models
│   ├── Llama-2-7b-chat-hf
│   ├── llama_finetuned_model
│   └── Qwen3-32B
├── module_01_preprocess
├── module_02_feature
├── module_03_constraint
├── module_04_upgrade
├── module_05_finetune
├── module_06_evaluation
├── reader
├── readme.md
├── requirement_file
│   ├── deepspeed_requirements.txt
│   ├── livebench_environments.txt
│   └── multi_if_requirements.txt
└── utils
```
## Training Dataset Generation
Our dataset generation pipeline is separated into 4 modules. To use our dataset generation pipeline, please download:
- Qwen3-32b model
- Llama-2-7b-chat model
- SpaCy en_core_web_sm model
- NLTK punk_tab

To download SpaCy en_core_web_sm model and NLTK punk tab, please use the commandline below.
```
python -m spacy download en_core_web_sm
python utils/download_nltk.py
```

Please adjust the qwen_32b_model configurations in configs/config.yml file.
| Configurations | Value |
|---|---|
| base_model | - Define the path to Qwen-32b-Instruct-model (qwen model)|
| world size | - Define the number of GPU in parallel  |
| max_new_tokens | - Define the number of new tokens generated based on your long-form QA dataset, default to 2048.  |

### Module 1: Preprocess
Preprocess Module is designed to rewrite the original question to increase the diversity of the dataset. Our pipeline can be used to preprocess any long-form QA dataset with correct format. To generate the first stage of dataset, make sure your input QA dataset is in .jsonl format and the detailed entries of dataset is shown below.
```
{"question": "What is the capital city of France", "answer": "Paris is the capital city of France."}
{"question": "Who discovered the gravity?, "answer": "Isaac Newton discover the existence of gravity."}
```
To run the dataset preprocess, please use the commandline below.
```
python module_01_preprocess/question_rewrite.py --jsonl_input dataset/natural_question/natural_question.jsonl --jsonl_output dataset/natural_question/natural_question_01.jsonl
```
### Module 2: Feature Extraction
Feature extraction module is designed to extract semantics and text structure features from the dataset. 

To run the feature extraction feature, use the dataset generated from module 1 as input dataset. 
```
python module_02_feature/dataset_annotation.py --jsonl_input dataset/natural_question/natural_question_01.jsonl --jsonl_output dataset/natural_question/natural_question_02.jsonl
```

### Module 3: Constraints Generation
Constraints generation module is designed to randomly assign suitable format constraints to each question based on its semantics and structural features. 

To run the constraints generation, use the dataset generated from module 2 as input dataset. 
```
python module_03_constraint/constraints_selection.py --jsonl_input dataset/natural_question/natural_question_02.jsonl --jsonl_output dataset/natural_question/natural_question_03.jsonl
```

### Module 4: Dataset Upgrade
Dataset upgrade module modify the QA dataset based on the constraints and features generated in previous stages. The dataset upgrade module is designed to effectively convert existing long-form QA dataset into instruction following dataset focussed on format alignment. 

To run the dataset upgrade, use the dataset generated from module 3 as input dataset.
```
python module_04_upgrade/dataset_rewrite.py --jsonl_input dataset/natural_question/natural_question_03.jsonl --jsonl_output dataset/natural_question/natural_question_04.jsonl
```
After the dataset upgrade is complete, you can create training dataset for different types of open-source models. The "module_04_upgrade/create_custom_dataset.py" is used to create multi-turn chat fine-tuning dataset and apply chat template based on the types of models.
```
python module_04_upgrade/create_custom_dataset.py --jsonl_input dataset/natural_question/natural_question_04.jsonl --jsonl_output dataset/natural_question/natural_question_finetune_llama.jsonl --model_path models/Llama-2-7b-chat-hf
```

## Model Fine-tuning
The configurations of model fine-tuning can be found in "configs/finetune/mistral.yml" and "configs/finetune/llama.yml". Before you start training the model, please modify the configurations files. 
| Configurations | Value |
|---|---|
| base_model | - Define the path to the base model|
| new_model | - Define the path to save the last checkpoint  |
| output_dir | - Define the path to save all the checkpoints.  |
| visible_devices | - Define the GPU ID for training |
| Path_to_dataset | - Define the path to the instruction tuning dataset |
| project name    | - Wandb project name |
| run_name        | - Wandb run name     |

You can also adjust other training hyperparameters in the configuration files. We use wandb to record our experiments.

To utilize multi GPU training, please define your own deepspeed configuration. To start the model training on a single GPU:
```
python module_05_finetune/finetune_multi_turn.py --config configs/finetune/llama.yml
```

After the model training, please merge the LoRA adapter to the base model with the commandline below.
```
python module_05_finetune/merge_lora_model.py --base_model_path models/Llama-2-7b-chat-hf --adapter_path path/to/adapter --new_model_path models/llama_finetuned_model
```

## Benchmarks
We utilized 3 widely used instruction following benchmark to evaluate the effectiveness of our method.
1. Instruction-Following Evaluation for Large Language Models
2. Multi-IF: Benchmarking LLMs on Multi-Turn and Multilingual Instructions Following
3. LiveBench: A Challenging, Contamination-Limited LLM Benchmark
4. ROUGE-Score

To use these benchmark, please clone the repositories into evaluations folder and install required packages, Then modify configs/config.yml file.
| Configurations | Value |
|---|---|
| base_model | - Define the path to the finetuned model (llama/mistral)|
| world_size | - Define the number of GPU used for benchmark  |
```
cd evaluations
git clone https://github.com/google-research/google-research.git
git clone https://github.com/facebookresearch/Multi-IF.git
git clone https://github.com/LiveBench/LiveBench.git
```

To install the required package, please create separated virtual environment for each benchmark. Please uses python=3.10.12 for all the virtual environments. IFEval and ROUGE-score use the same environment stated in requirement_file/deepspeed_requirements.txt.

LiveBench environment:
```
cd evaluations/LiveBench
pip install -e .
cd livebench
python download_questions.py
```

Multi-IF environment:
```
cd evaluations/Multi-IF
pip install -r requirements.txt
pip install vllm
git clone https://huggingface.co/datasets/facebook/Multi-IF data/Multi-IF
```

### IFEval
To generate the result of IFEval benchmark, use the commandline below.
| Configurations | Value |
|---|---|
| --model_name | - Please select between llama_model and mistral_model|
| --output_path| - Please modify the output filename but keep the output path consistent|
```
python module_06_evaluation/ifeval_inference_vllm.py --model_name llama_model --output_path evaluations/google-research/instruction_following_eval/data/llama_finetuned_model.jsonl
```
To show the result of IFEval benchmark. Please modify the input_response_data based on the filename of finetuned model output.
```
cd evaluations/google-research

python3 -m instruction_following_eval.evaluation_main   --input_data=./instruction_following_eval/data/input_data.jsonl   --input_response_data=./instruction_following_eval/data/llama_finetuned_model.jsonl   --output_dir=./instruction_following_eval/data/
```
### Multi-IF
To generate the result of Multi-IF benchmark, use the commandline below. Please adjust the batch size and tensor paralled size based on the GPU configurations and define the path/to/finetuned/model for model_path and tokenizer_path.
```
cd evaluations/Multi-IF

git clone https://huggingface.co/datasets/facebook/Multi-IF data/Multi-IF

python multi_turn_instruct_following_eval_vllm.py \
        --model_path path/to/finetuned/model \
        --tokenizer_path  path/to/finetuned/model \
        --input_data_csv data/Multi-IF/multiIF_20241018.csv \
        --batch_size 250 \
        --tensor_parallel_size 4
```
### LiveBench
Please modify the commandline based on the details below.
| Configurations | Value |
|---|---|
| /path/to/finetuned/model | - Change to absolute path to the fine-tuned model|
| /path/to/chat/template.jinja | - Change to the absolute path to the model official prompt template (.jinja format)  |
| /path/to/finetuned/model | - Change to absolute path to the fine-tuned model|
| finetuned_model_name     | - Name of finetuned model to be shown in the benchmark |
| -- tensor-parallel-size  | - Adjust based on the number of available GPU |
| --parallel-requests      | - Adjust based on the available Vram  |

Host a local vllm server.
```
vllm serve /path/to/finetuned/model --served-model-name finetuned_model_name --host 0.0.0.0 --port 8000 --dtype bfloat16 --api-key local-secret --chat-template /path/to/chat/template.jinja --tokenizer /path/to/finetuned/model --tensor-parallel-size 4
```

Start generate result. Please adjust the parallel-request based on your GPU configurations.
```
cd evaluations/LiveBench/livebench

python run_livebench.py --model /path/to/finetuned/model --bench-name live_bench/instruction_following --livebench-release-option 2024-11-25 --max-tokens 4096 --api-key local-secret --api-base http://localhost:8000/v1 --parallel-requests 50 
```

To show the result of LiveBench, please use the commandline below.
```
python show_livebench_result.py --bench-name live_bench/instruction_following --model-list finetuned_model_name
```

### ROUGE-Score
ROUGE-Score is used to evaluate whether our framework can retain the original meaning of the long-form QA dataset.

To start the benchmark, use the commandline below.
```
python module_06_evaluation/rouge_evaluation.py --jsonl_input dataset/natural_question/natural_question_04.jsonl
```