# DebateGPT: Fine-tuning Large Language Models with Multi-agent Debate Supervision

For all experiments you must set your OpenAI key. You can do this using the export command below 
	`export OPENAI_API_KEY = <key>`
or set this manually in each file
	`openai.api_key = '' #YOUR KEY HERE`

Install requirements using
	`pip install -r requirements.txt`

## Generating data

First download the Alpaca dataset from [here](https://github.com/tatsu-lab/stanford_alpaca).

To generate new data, run from the outermost directory
	`python generate_data.py --dataset_size [DATASET_SIZE]`

Then format the data into jsonl files using the same file:
	`python generate_data.py --reuse`

To fine-tune GPT-3.5 and get DebateGPT-3.5
	`python finetuning.py --epochs [NUM_EPOCHS]`
When you fine-tune, save the model string from the fine-tuning API and add this to all evaluation files.


## Running experiments

Once you have a fine-tuned model, you must take the model string and insert this into each file under the `generate_answer` function.

The code for running arithmetic, MMLU, ARC, and Winogrande, and AlpacaEval tasks may be found in the following subfolders

* ./arithmetic/ contains code for running arithmetic
* ./mmlu/ contains code for running mmlu results.
* ./arc/ contains code for running arc results
* ./winogrande/ contains code for running winogrande results
* ./alpaca-eval/ contains code for running alpaca-eval results

**Math:**

To generate and evaluated answer for Math problems through multiagent debate, cd into the math directory and run:
	`python gen_math.py`
	
	
**MMLU:**

To generate answers for MMLU, cd into the MMLU directory and run:
	`python gen_mmlu.py --agents 1 --rounds 1 --model ft-gpt3.5`

To evaluate the generated results of MMLU:
	`python eval_mmlu.py --agents 1 --rounds 1 --model ft-gpt3.5`
	
You can download the MMLU dataset [here](https://github.com/hendrycks/test)

**ARC:**

To generate answers for ARC, cd into the Winogrande directory and run:
	`python gen_arc.py --agents 1 --rounds 1 --model ft-gpt3.5`

To evaluate the generated results of ARC:
	`python eval_arc.py --agents 1 --rounds 1 --model ft-gpt3.5`
	
You can download the ARC dataset [here](https://allenai.org/data/arc)

**Winogrande:**

To generate answers for Winogrande, cd into the Winogrande directory and run:
	`python gen_winogrande.py --agents 1 --rounds 1 --model ft-gpt3.5`

To evaluate the generated results of ARC:
	`python eval_winogrande.py --agents 1 --rounds 1 --model ft-gpt3.5`
	
You can download the Winogrande dataset [here](https://winogrande.allenai.org/)

**AlpacaEval**

To generate answers for AlpacaEval, cd into alpaca-eval directory and run:
	`python alpaca_leaderboard.py --model ft-gpt3.5`
Then, clean the generate responses for good evaluation using:
	`cd ..`
	`python clean-data.py --file {FILENAME}`
Then use the alpaca-eval repo to run an evaluation on the responses [here](https://github.com/tatsu-lab/alpaca_eval)