# 🧠 Revisiting Multi-Agent Debate as Test-Time Scaling: When Does Multi-Agent Help?

This repository contains the implementation and evaluation code for our study on multi-agent debate as a test-time scaling method. The codebase supports two main domains:

- **Mathematical Reasoning**
- **Safety**

## Environment Setup

We recommend using conda for dependency management:

```bash
conda create -n mad python=3.11 -y
conda activate mad
pip install -r requirements.txt
```

> ⚠️ If you want to use **OpenAI's API for evaluation**, you must **downgrade** the OpenAI Python package:

```bash
pip install openai==0.28.0
```

## Model Server Setup (Optional)

To run your own API server for model inference:

```bash
bash open_server.sh
```

Refer to `open_server.sh` for details on model loading and configuration.

## Inference Scripts

The codebase includes inference scripts for two domains:

### Mathematical Reasoning

- Single-agent inference:

```bash
bash scripts/generation/run_math_single.sh
```

- Multi-agent debate:

```bash
bash scripts/generation/run_math_multi.sh
```

Key parameters:

- Dataset selection (`gsm8k`, `math500`, `aime2024`, `aime2025`)
- Number of agents and voting mechanism

### Safety Evaluation

- Single-agent self-refinement:

```bash
bash scripts/generation/run_safety_single.sh
```

- Multi-agent debate:

```bash
bash scripts/generation/run_safety_multi.sh
```

Configurable options:

- Dataset selection (`safetybench`, `advbench`)
- Scaling method (`self-refinement`, `best-of-N`)
- Persona and judge model settings

## Evaluation Scripts

### Mathematical Reasoning Evaluation

Single-agent evaluation:

```bash
bash scripts/eval/eval_math_self.sh DATASET_NAME
```

Multi-agent evaluation:

```bash
bash scripts/eval/eval_math_multi.sh DATASET_NAME
```

Run all evaluations across all datasets:

```bash
bash scripts/eval/evaluate_all.sh
```

### Safety Evaluation

```bash
bash scripts/eval/eval_safety_api.sh  # API-based evaluation
```
