# MoAA

This readme provides instructions for running the MoAA process, which includes Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) stages.
## Requirements
- Python 3.10+
- Pytorch 2.0+
- CUDA 11.8+
- VLLM
- HuggingFace
- OpenAI
- Transformers
- Datasets
- trl

## MoAA SFT (Supervised Fine-Tuning)

To run the SFT stage for MoAA, use the following commands:

```
# to run the SFT stage for LLaMA 3.1 8B
bash run_SFT_llama.sh
```

```
# to run the SFT stage for GEMMA 2 9B
bash run_SFT_gemma.sh
```

We have already generated the SFT dataset for this stage using the MoA appraoch, you can directly run the script above. Data can be found in `data/SFT/MoA_r1_4modelsWGQL` and `data/SFT/UC5k_MoA_r1_4modelsWGQL`. 

#### Generate SFT dataset by yourself
However, you can also generate the SFT dataset by yourself. Please refer to the `generate_for_Ultrachat.py` and `generate_for_UC5k.py` in `MoA/Stage 1`. You can

```
cd MoA/Stage 1
bash run_generate_Ultrafeedback_data.sh
```
to generate the data for Ultrafeedback. 

For Ultrachat, since it is multi-turn, you need to first run
```
run_generate_ultrachat_r0.sh
```
to generate the data for Ultrachat for each proposer. You should get the paths of the data you generated and then put it in the `run_generate_ultrachat_r1.sh` file. Then you can run
```
cd MoA/Stage 1
bash run_generate_ultrachat_r1.sh
```
to generate the data for Ultrachat.

## MoAA DPO (Direct Preference Optimization)

To run the DPO stage for MoAA, use the following commands:

```
# to run the DPO stage for LLaMA 3.1 8B
bash run_DPO_llama.sh
```

```
# to run the DPO stage for GEMMA 2 9B
bash run_DPO_gemma.sh
```

However, in order to run the DPO stage, you need to first run the SFT stage as we didn't include the SFT model in the repo due to the large size. After your SFT is done, you should first generate the DPO dataset by yourself. You can first generate candidate responses using code in `MoA/Stage 1`:

```
cd MoA/Stage 2
bash generate_preference_data.sh
```
Then you can use a reward model to generate the preference data. We use the [ArmoRM](https://huggingface.co/RLHFlow/ArmoRM-Llama3-8B-v0.1) reward model for this code base. Our MoA as a reward model will be added soon after more review and cleaning. You can run the following command to generate the preference data with the reward model after your specify the paths of the candidate responses in the bash script:

```
bash generate_preference_data_RM.sh
```

After you get the preference data, you can run the DPO stage shown above to start the training. 
