


# MASTER: Instruction Tuning Framework Based on Multi-Scenario Data Augmentation

[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](LICENSE)
[![Python 3.10+](https://img.shields.io/badge/Python-3.10%2B-green.svg)](https://www.python.org/)
[![Framework-LLaMA-Factory](https://img.shields.io/badge/Framework-LLaMA__Factory-orange.svg)](https://github.com/hiyouga/LLaMA-Factory)

## Project Overview
We construct a high-quality instruction-tuning dataset through collaborative augmentation of three educational scenarios: **Error Correction Generation**, **Multi-Round Debate**, and **Analogy-based Expansion**. Experiments show that the MASTER-LLaMA model achieves:
- A 13.84% accuracy improvement on the MMLU-PRO-MATH test set
- An 11.59% increase in Pass@1 for HumanEval code generation tasks

## Directory Structure
```bash
├── agentclass/                # Data augmentation scenarios
│   ├── make_error/            # Error correction generation
│   │   ├── math/              # Math data augmentation
│   │   │   ├── sbatch/        # Execution scripts
│   │   │   └── output/        # Augmented results
│   │   ├── code/              # Programming data augmentation
│   │   ├── openhermes/        # General-purpose data augmentation
│   ├── debate/                # Debate scenario
│   └── expand/                # Analogy/variation generation
├── LoRA/                      # Model fine-tuning
│   ├── MASTER_llama3-8b/      # Training scripts and data        
│   ├── MASTER_mistral-7b/         
│   └── ....../               
├── eval/                      # Evaluation module
│   ├── MATH_EVAL/             # Math ability evaluation
│   ├── CODE_EVAL/             # Code ability evaluation
│   └── GENERAL_EVAL/          # General task evaluation
├── quantize_model/            # Model storage directory
├── Embedding_model/           # Cosine similarity embedding model directory
├── data/                      # Source of augmented data and benchmarks
└── final_model/               # Final trained models


```

## Quick Start

### 1. Data Augmentatioon
```bash
# Download required models
wget -P quantize_model/ https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct
wget -P quantize_model/ https://huggingface.co/Qwen/Qwen2.5-7B-Instruct
wget -P quantize_model/ https://huggingface.co/Qwen/Qwen2.5-14B-Instruct ...
# Example: Math data error correction
cd agentclass/make_error/math/sbatch
sbatch first_turn.sh 
sbatch second_turn.sh
sbatch final_turn.sh
```

For example, to augment math Q&A data using the make_error scenario, run the first_turn.sh, second_turn.sh, and final_turn.sh scripts in the agentclass/make_error/math/sbatch folder. Then locate the final_turn.json file in the output folder, which contains all rounds. Extract question, student error, and teacher correction from the Instruct field, and the corrected student answer from student_correct. Combine them into ShareGPT-format 4-turn data.

We randomly select 10,000 out of the 30,000 augmented math/code examples and merge them with 9,000 general-purpose samples to form the final BOOST-QA training dataset in ShareGPT format. You may vary the quantity or ratio of augmented data, or apply our approach to other datasets.

### 2. LoRA Fine-Tuning

We conduct the following experiments:

Fine-tune llama3-8B-base, mistral-7B-base, and qwen2.5-7B-base using both original and MASTER-augmented data (main experiment).

Compare against baselines such as TAGCOS, CoT-fine, RandomAug, and SpellingAug on llama3-8B-base.

Ablation study: fine-tune llama3-8B-base using data from single or paired augmentation scenarios.

We use the LLaMA-Factory framework. Each model subfolder under LoRA/ includes its config YAML, training scripts, and datasets.

### 3. Model Evaluation

During evaluation, tasks are categorized into three sections under the eval directory: MATH_EVAL, GENERAL_EVAL, and CODE_EVAL. Each section contains multiple benchmark-specific evaluation projects. For example, to evaluate the performance of the llama3-8B-base model trained on the BOOST-QA dataset on the MMLU-PRO-MATH benchmark, you simply run the llama3_eval.sh script under eval/MATH_EVAL/MMLU_PRO/llama3_MA_Gen_data to perform inference, and then execute llama3_match.sh to extract answers and calculate accuracy.

Specifically, for subjective benchmarks such as MATH, after inference, you must run the judge.sh script to let a judgment model determine the correctness of the answers, followed by match.sh to compute the final accuracy. For programming tasks like HumanEval, after inference, the data_clean.py script is used to extract executable code strings, and then the judge.sh script evaluates the results by running the code.