<!-- # 🐦 Magpie -->

## Installation

**Build environment**
```
cd magpie
conda create -n magpie python=3.10 -y
conda activate magpie
pip install -r requirements.txt
```

**Get access to Llama-3 models from 🤗 Huggingface**

You can apply for Llama-3 model access [here](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct). To login in the terminal, enter:
```
huggingface-cli login
```
then enter your Huggingface private key beginning with "hf_".

## Toy Example

**Play with Jupyter Notebook**

The toy example can be found in [`demo.ipynb`](demo.ipynb). Have fun! 

## Batched Data Generation
We use Llama-3-8B-Instruct as an example to demonstrate the batched data generation process. To run batched generation, you can simply run:
```
cd scripts
bash magpie.sh
```
The script will generate both instructions and responses in the data folder. It has been tested on an RTX 4090 24G GPU. If you are using GPUs with less memory, consider implementing [quantization](https://docs.vllm.ai/en/latest/quantization/fp8.html).

We also provide scripts for other models in the [`scripts`](scripts) folder. Note that for model sizes greater than 8B, you may need 4*A100 GPUs to run the scripts.

### Batched Multi-turn Data Generation \[Optional\]
After generating instruction-response pairs, you can extend them to multi-turn conversations. To do so, simply run the following command:
```
bash magpie-multi-turn.sh ***_ins_res.json
```
where `***_ins_res.json` is the single-turn instruction-response pairs generated in the previous step.

## Dataset Filtering
### 1. Tagging
To tag the generated instruction-response pairs, you can run:
```
cd scripts
bash unitag.sh ***_ins_res.json all
```
This script will automatically generate quality, difficulty, task category, safety, reward, and language for the generated dataset. You can also generate one tag at a time. For example, if you just want to generate the safety label using device 0, you can run:
```
cd scripts
bash unitag.sh ***_ins_res.json safety 0
```
### 2. Data Concatenation and Converting
You may generate datasets with different generation configurations. We provide a Jupyter notebook [here](data/data_concatenation.ipynb) for concatenating all datasets and converting them to ShareGPT format, which is fully supported by [Axolotl](https://github.com/OpenAccess-AI-Collective/axolotl) for fine-tuning.

### 3. Removing Repetition
Once you have a full dataset converted to ShareGPT format, you can calculate the minimum neighbor distance of each instruction and remove repetitions. To do so, run:
```
cd exp
python gen_dis.py --input_file ***_sharegpt.jsonl
```
where `***_sharegpt.jsonl` is the dataset path obtained in the previous step. The Python script will take care of building the FAISS index and calculating the minimum distance. 

### 4. Design and Apply Your Filter
We provide a Jupyter notebook [here](data/data_filter.ipynb) for simple filtering. You can adjust the filtering parameters to design and apply your own filter based on your needs.