# The Limits of Debate as a Scalable Oversight Strategy 

We investigate the limits of debate as a scalable oversight strategy by extending Arnesen's paper on [Training Language Models to Win Debates with Self-Play Improves Judge Accuracy](https://arxiv.org/abs/2409.16636). Debate is a scalable oversight strategy where two models argue for opposing sides of a question: one of them is assigned to defend the correct answer, and the other is assigned to defend an incorrect answer. The debaters are able to see each other's arguments when conducting the debate. Based on the transcripts of the debate, a judge model decides which debater has presented the stronger argument for its side. Arnesen's paper finds that, in the context where superhuman debater models are simulated through privileged access to information, there is a "small but significant positive relationship between the ability of the model to win a debate and the usefulness of that model's debate transcripts in discovering true answers."

In this project, we look at reproducing Arnesen's results, and look at extending the results to a wider variety of models. In particular, we look at using the same SFT and DPO method described in Arnesen's paper to train GPT 4.1 as the judge model, and we explore reinforcement fine-tuning of o4-mini as a debater model. We are also currently exploring using the GradientAI Llama3-8B-Instruct model with an extended context length that they use as a judge. Finally, we have also run experiments with 4.1 nano.

We also explore different contexts for the debaters and judge. In Arnesen's paper, the [QuALITY dataset](https://arxiv.org/abs/2112.08608) of stories is used to generate questions in a hidden information task, where debaters have access to the stories while the judge does not. In this project, we have formulated a new dataset based on Lojban, a constructed logical language, using questions developed from the [FindTheFlaws dataset](https://arxiv.org/abs/2503.22989), while retaining the hidden information character of the task. In this task, the debaters are asked questions about translations of Lojban sentences, and are provided with background material from the Lojban language reference.

## Large Artifacts Omitted from the Code Archive

This repository archive is intended to contain the anonymized code, configuration files, schemas, and small authored inputs needed to inspect and rerun the pipeline. Large external datasets, generated JSONL files, caches, and SFT artifacts are not included in the code ZIP. These files can be regenerated from the included scripts and inputs, or obtained from the cited upstream sources or from the authors after review.

The omitted Lojban datasets are generated artifacts:

```bash
cd task_relevant
python3 dataset_generator.py
mkdir -p ../data/datasets/lojban
mv lojban_dataset.jsonl lojban_dataset_test.jsonl ../data/datasets/lojban/
```

If the Lojban definition files need to be rebuilt first, run `python3 word_search.py` from `task_relevant/`. The generation uses the included `task_relevant/data.csv`, `task_relevant/lojban_test_set.csv`, `task_relevant/downloaded_sections/`, and Lojban definition files.

The QuALITY and quality-debate datasets are external/source artifacts and should be placed at:

- `data/datasets/quality/QuALITY.v1.0.1.htmlstripped.train`
- `data/datasets/quality/QuALITY.v1.0.1.htmlstripped.dev`
- `data/datasets/quality/QuALITY.v1.0.1.htmlstripped.test`
- `data/datasets/quality-debates/debates-readable.jsonl`

The QuALITY files can be obtained from the QuALITY dataset release. The human/model debate transcript file can be obtained from the authors or the corresponding upstream debate data source after review.

The scratchpad debate pickle at `data/datasets/scratchpad-quality-debates/scratchpad-quality-debates.p` is a cache. It is regenerated automatically by `ScratchpadQualityDebatesLoader` from `data/datasets/quality-debates/debates-readable.jsonl` when absent.

The SFT artifacts under `sft_data/judge/` and `sft_data/debater/debater_combined.jsonl` are derived training files. To rebuild the debater SFT file after obtaining the source CSV and quality-debate transcript file:

```bash
python3 sft_data/khan_to_jsonl_for_debaters.py \
  --csv path/to/llm_debate_human_judge_dataset.csv \
  --original data/datasets/quality-debates/debates-readable.jsonl \
  --out sft_data/debater/debater_combined.jsonl
```

Judge SFT helper scripts are in `sft_data/`: `prepare_judge_sft_human_data.py`, `prepare_judge_sft_llm_data.py`, `prepare_data_for_sft.py`, and `unicode_fix.py`. The large source CSV archive `sft_data/llm_debate_human_judge_dataset.csv.zip` is omitted and can be obtained from the authors after review.

## MARS-Specific Instructions

We have a command-line interface that helps us with running the replication! This depends on setting up three environment variables so that you do not have to keep entering them.


- `PEM_FILEPATH`: create a PEM file at https://cloud.lambda.ai/ssh-keys and put it in ~/.ssh/
- `LAMBDA_LABS_API_KEY`: create one at https://cloud.lambda.ai/api-keys/cloud-api
- `OPENAI_API_KEY`: create one at "https://platform.openai.com/api-keys

Then, in your terminal or in your `.bashrc` / `.bash_profile`, set:
```
export LAMBDA_LABS_API_KEY=...
export PEM_FILEPATH=~/.ssh/lambda-labs.pem
export OPENAI_API_KEY
```

Once this is done, you can run the commands below. If this is not done, you will be prompted to provide each of these before the CLI can run.

### Connect to the Lambda Labs instance

If this fails, this may be because we have shut the instances down to save money. You can start an instance at https://cloud.lambda.ai/instances.

```bash
./cli.sh ssh
```

### Run a long-running background task

This command allows a background task to be run without needing your terminal to be open. The configurations are listed in `standard_experiment.yaml` – customize this with the command you want to run.

```bash
./cli.sh bg-task -- python scripts/run_debate.py --configuration="debater-untrained-judge-untrained"
```

### Viewing logs

```bash
./cli.sh logs -f task-20250709-112351
```

### Retrieve generated files from the instance

This syncs the file state on the machine to your local machine. For maximum safety, this does not overwrite any existing files on your machine.

```bash
./cli.sh rsync-to-host
```

### Monitor all instances

This allows s

```bash
./cli.sh monitor -- nvidia-smi
```

## Run Orchestration

We have a tool that allows debate experiments in `standard_experiments.yml` to be run declaratively across our fleet of instances. To do this, create a new configuration file in `run_orchestrator/runs`. The schema for such a configuration file is given by `run_orchestrator/experiment_orchestrator.schema.json`. The output data is saved into a namespaced folder in `outputs` based on the `name` in the configuration (not the filename!). For example, a configuration named `debater-trained` will save its outputs into `outputs/debater-trained`. This allows parallel processes across different instances to save their outputs into the same directory across machines, which makes it possible for us to merge them using the `merge-data` subcommand. 

To start the command-line tool, run `python run_orchestrator/run_experiments.py`, which allows running the configuration files with `start`, retrieving the results with `download`, merge data collected on multiple instances with `merge-data`, and plot an unsatisfying graph using `graph`.

# Arnesen's Original Readme

## Setup

### Basic Setup
1. Pull down this package using `git clone`
2. Install the dependencies using `pip install -r requirements.txt`
3. Create an `.env` file with the following variables:
	`SRC_ROOT`: Contains the file path of the root of this code base
	`INPUT_ROOT`: Contains the file path to the directory where the dataset files are located
	`OPENAI_API_KEY`: OpenAI key (Optional, if one is using OpenAI for judging)
	`OPENAI_ORGANIZATION`: OpenAI organization (Optional, if one is using OpenAI for judging)
	`META_ACCESS_KEY`: Meta huggingface token, (Optional, if one is downloading a Llama model via HF)
	`ANTHROPIC_API_KEY`: Anthropic access key (Optional, if one is using Anthropic for judging)

### HPC Setup
1. Follow the HPC's guide to setup a Singularity container. Remember to install the dependencies in the virtual environment in that container.
2. Run `bash bash_scripts/hpc_setup.sh` from inside the Singularity container

## Tests

To run the tests, run `bash bash_scripts/basic_tests.sh`. To run an individual test, run `bash bash_scripts/operational_test.sh [TEST_NAME]`. To test the training code, run `bash bash_scripts/train_tests.sh`

### Scripts
The primary entrance points are in the `scripts` directory. Here is what each of the scripts are intended to do:
* **run_debate.py**: This kicks off a series of debate rounds. The configuration for those rounds (number of debaters, models used, number of speeches, use of scratchpad, decoding strategy, batching, etc.) is set in `experiments/configs`. Example Usage: `python3 ./scripts/run_debate.py --num_iters=50 --log_level='DEBUG' --configuration='Test'`
* **run_iterative_dpo.py**: Kicks off a DPO training run. The configuration for the training run is set in `train/configs/dpo_config.yaml`. Example usage: `python3 ./scripts/run_iterative_dpo.py --configuration='Test'`
* **run_sft.py**: Kicks off a SFT training run. The configuration for the training run is set in `train/configs/sft_config.yaml`. Example usage: `python3 ./scripts/run_sft.py --configuration='Test'`

## Code Structure

### High-Level Summary

One can use this package to train models or to run experiments (aka validation or inference). The key modules are as follows:

1. **Data:** This module provides a unified interface for loading, parsing, and interacting with the debate datasets. 
2. **Prompts:** This modules contains a configuration file and parser where you define the prompts you expose to the debaters and judge. It interacts with the `data` module to fill out the prompts with actual values.
3. **Models:** This module contains a unified interface for the resources that can generate text. Each model can wrap around an open LLM (e.g. Llama 2), an API (e.g. OpenAI's GPT4), or any other mechanism for generating text (e.g. it is convenient, for testing purposes, to have a model that generates random text). 
4. **Debate:** This module defines the core debate-specific abstractions. That includes the abstraction for an agent (i.e. a `Debater` or `Judge`), a `Transcript`, a speech order, and a `DebateRound`. The agents maintain an internal transcript and wrap around a `Model` in order to generate text.
5. **Experiment:** This contains the logic for running experiments. A loader read a config file and creates a series of `DebateRound` objects, whose results are then tallied, logged, and displayed.
6. **Train:** This modules contains all the logic for training a model. It interacts with the `prompt`, `data`, `models` and `debate` modules to consistency across input and output formats.
7. **Scripts:** This is the entrance point for running experiments and training models.

### Data
This module handles the loading of different datasets. The data loaded in these datasets is used for the questions that are debated in the experiments as well as the training signal for the training scripts. The datasets are as follows:
* **Quality Debates**: These are the debates that were run as part of the NYU Human Debate experiments.
* **Quality**: These are the full set of hard questions from the QuALITY dataset.

### Debate
This module controls the primary abstractions for actually running experiments. The main abstractions are as follows:
* **Debate Round**: This is a structure containing two debaters and a judge. Each debater or judge generates speeches and collect their own transcripts. The judge is the main coordinator for the round, so the primary role of the `DebateRound` object is to pass information (speeches) from the different debaters and to report the winner at the end of the round.
* **Judge**: This agent asks questions, renders verdicts, and decides who the next speaker is. It uses a `Model` to generate text. It receives a `SpeechFormat` argument that determines the order of speeches that it expects. There is a child `BranchedJudge` class for the use in branched rounds.
* **Debater**: This wraps around a `Model` class, which it uses to generate speeches when called. There is a child `BestOfNDebater` class for Best-of-N rounds (when it has to generate multiple speeches). It receives a `SpeechFormat` argument that determines the order of speeches that it expects.
* **Transcript**: Each participant (judge and debater) maintains a `Transcript` object that tracks the round so far. It can convert into either a `ModelInput` format that can be used by a `Model` or a string that can be written to the logs.
* **Speech Format**: This defines the order of speeches. It has default options for both debate and consultancy. These get passed to the debaters and judge so that they know whose turn it is to speak.

### Experiments
This module controls the configuration for running experiments (aka debate rounds) as well as the metrics collection for those experiments. See `experiments/configs/example_experiment.yaml` for a detailed explanation of each potential configuration option.
* **experiment_loader.py**: This reads from the configuration set in `experiments/configs` and constructs a set of `DebateRound` objects.
* **quotes_collector.py**: This handles the metrics collections for the metrics about quotation usage.
* **results_collector.py**: This handles the overall metric collection.

### Models
This module also contains the `Model` abstraction that is used to generate text. The models are as follows:
* **Deterministic Model**: This model just outputs the same text over and over again. This is useful when testing.
* **LLM Model**: This model generates text by invoking a model from Huggingface (yes I know the "M" in LLM stands for "model" but "ll_model" looked strange). It has child classes for different flavors of Huggingface models that have different input formats. At the moment, those two implementing classes are for Llama and Mistral models. This also contains logic for implementing a Linear Probe judge (aka a judge that adds a linear layer on top of the activations of another model), however this is less tested.
* **Offline Model**: This model just repeats the text from a text transcript that it is provided. This is useful if one wants to re-evaluate the debate with a different judge than the one originally used.
* **OpenAI Model**: This model generates text by calling OpenAI.
* **Anthropic Model**: This model generates text by calling Anthropic's Claude model.
* **Random Model**: This model generates random strings of text. It is useful when testing.
* **Served Model**: This model makes requests to a localhost destination where it expects a model to be hosted (this can speed up inference dramatically if the model is hosted using an inference engine).
* **Human Model**: This model outputs real text from the corresponding human debate transcripts. It only works if one is sampling questions from the real human debates. (Not recommended)

### Prompts
This module controls the prompts that are used while conducting experiments or training. There is a parser for both normal prompts and 'dynamic prompts' (which are just prompts that change depending on certain attributes). The actual prompt language can be found in `prompts/configs`

### Train
This module controls the logic for finetuning the models using DPO, PPO, or supervised finetuning. The specific classes are:
* **DPO Trainer**: This is a class for training models using DPO
* **PPO Trainer**: This is a class for training models using DPO
* **SFT Trainer**: This is a class for training a model to debate via supervised finetuning.
* **Row Converter**: This is a utility for converting a transcript into the format expected by the trainers. To do so, it interacts with the `debate`, `prompts`, and `data` abstractions.
* **Train Utils**: Additional utilities for loading models and datasets, specifically for training.

## Potential Uses

### Running a Tournament (Data Generation / Validation)
1. Create a new config entry under `experiments/configs`. If you're running this locally, add the entry under `test_experiment.yaml` and if you're running it remotely, add it under `standard_experiment.yaml`. 
2. To kick off the new tournament, run `python python3 ./scripts/run_debate.py --num_iters=Number-of-Iterations --configuration='Your-New-Configuration-Name`. If you're running it locally, you'll need to a `--test` flag so it knows to look in the right place.

**Examples:**
All of the following examples can be found under `experiments/configs/standard_experiment.yaml` unless otherwise specified.
* **Debate - Data Generation**: See the entry for `Data Generation - Llama3 - MultiRound - HalfBranched - FullTrain - DPO`. This demonstrates a single model with branched rounds.
* **Consultancy - Data Generation**: See the entry for `Data Generation - Llama3 - MultiRound - HalfBranched - FullTrain - Consultancy`
* **Debate - Self Play Validation**: See the entry for `val - experiment debate - 16`.
* **Consultancy - Self Play Validation**: See the entry for `val - experiment consultant - 16`
* **Debate - Cross Play Tournament**: See the entry for `val - cross-play - 0 - 176 - 464`. This config covers a small slice of a broader round-robin tournament. (For ease in parallelizing the cross-play tournament across multiple machines, we split up the round robin into small chunks -- note that, while only three models are used in this particular run, all of the other models are still defined so that the proper rounds are assigned to each model)
* **Debate - Offline Judging**: See `Offline_Debate_Test` under `test_experiment.yaml`.
* **Consultancy - Offline Judging**: See `MultiTurn_Consultancy_Offline_Test` under `test_experiment.yaml`. Note that `alternate` is set to True (this is needed only for offline consultancies).
* **Converting Single Consultancy to Double Consultancy**: See the entry for `Consultancy_Double_Test` under `test_experiment.yaml`. Note that the speech structure is `default_debate` and that `convert_to_double_consultancy` is True.

### Training a New Model
1. Depending on how you want to train your model (SFT, DPO, PPO, etc.), add a new entry in the associated config that can be found under `train/configs/[dpo/ppo/sft]_config.yaml`.
2. Kick off your training run by running the associated script in the `scripts` directory (e.g. `python scripts/run_sft.py --config="YourNewConfigName"`).

**Examples:**
1. **DPO Training** See `Iterative - FullTrainDebateLowLR - 77` under `train/configs/dpo_config.yaml`. Note how multiple existing datasets are concatenated together to train a single model.
2. **SFT Training** See `Train - Llama3 - Human and GPT` under `train/configs/sft_config.yaml`

### Adding a New Dataset
You might want to create a new dataset if you're using a data source other than QuALITY. Here are the steps to add a new dataset:
1. Create a new directory under `data/datasets/` and add your file(s) there.
2. Under the `data` directory, create a python file called `[your_dataset_name]_loader.py`. In that file, you will define two classes, a `[YourDatasetName]Loader` and `[YourDatasetName]Dataset`. The loader should just parse your dataset and pass it in to the dataset constructor. The dataset itself will split out the data into the rows that you want, following the interface defined in `data/dataset.py`. The file `data/quality_loader.py` is also a good example that one can follow (although it has some extra deduplication logic you may not need).
3. Under `data/dataset.py`, add a new enum corresponding to your new dataset type.
4. Under `data/loader_utils.py`, add a new conditional to support creating your new dataset.
5. Under `data/__init__.py`, import your new dataset.

Now you should be good to reference your new dataset from a train or experiment config file.

### Adding New Prompts
1. Go to `prompts/configs/prompts.yaml` and add a new entry corresponding to your new set of prompts. Feel free to follow the examples in that file -- the two sets of existing prompts are called `Debate Prompt` and `Consultancy Prompt`.
2. If your new prompt uses any unique names that do not already exist in the existing prompts (the 'names' are the titles of each individual sub-entry such as `overall_system` or `pre_opening_speech`), then go to `prompts/parser.py` and add that title as a new enum under `PromptTag`.
3. If you require filling in a new kind of value (currently, we support filling out prompts with a debater name, opponent name, position, opponent position, topic/question, and the background text), then add that new kind of value under `PromptConfig` in `prompts/parser.py`. This new value will be inserted into the prompts as long as the exact key is referenced inside angular brackets (e.g. if you reference `<NAME>` in your prompts, then `<NAME>` will be replaced with the value associated with `name` in the PromptConfig.) 

### Creating New Speech Orders
You might want to create new speech order if you do not want the speeches to be delivered in the same format as we have previously set.
1. Under `debate/speech_format.py`, create a new method in the `SpeechFormat` object, following the example of `default_debate_format()` and `default_judge_format()`. 
2. Also under `debate/speech_format.py`, create a new enum under `SpeechFormatStructure.`
3. Also under `debate/speech_format.py`, create a new enum under `SpeechFormatType.`

Now you can reference your new speech orders in your configs by using the name you gave under `SpeechFormatType`.

### Using a new open-weight LLM
We currently support generating speeches using Mistral or Llama. If you want to add a different type from Huggingface, you should do the following:
1. Under `models/llm_model.py`, create a new class that inherits from `LLModel`. At a minimum, it just needs to define the instruction prefixes and suffixes. See the `LlamaModel` class as a good example. It also should implement a `copy()` method.
2. Also under `models/llm_model.py`, create a new enum under `LLMType`.

Now you should be free to reference your new model type in your configs by using the name you defined as the `LLMType` enum.
