<h1 align="center">Multi-LLM-Debate-Judge</h1>

Thank you for reviewing our paper!

This repository contains the implementation for the paper "Multi-Agent Debate for LLM Judges with Adaptive Stability Detection".

## Table of Contents

- [Table of Contents](#table-of-contents)
- [Installation](#installation)
- [Project Structure](#project-structure)
- [Usage](#usage)
  - [Configuration](#configuration)
  - [Running Evaluations](#running-evaluations)
- [Analysis](#analysis)
- [Notebooks](#notebooks)
- [Linting and Formatting](#linting-and-formatting)

## Installation

This project uses Poetry for dependency management.

1.  Install Poetry (if you haven't already):
    ```bash
    curl -sSL https://install.python-poetry.org | python3 -
    ```
2.  Open the repository:
    ```bash
    cd Debate-LLM-Judge
    ```
3.  Install dependencies:
    ```bash
    poetry install
    ```
4.  Activate the virtual environment:
    ```bash
    poetry shell
    ```

## Project Structure

```
Debate-LLM-Judge/
├── LICENSE
├── poetry.lock
├── pyproject.toml
├── pytest.ini
├── README.md
├── setup.cfg
├── multi_llm_debate/
│   ├── __init__.py
│   ├── analysis/         # Scripts for analyzing debate results
│   ├── debate/           # Core debate logic (agents, rounds)
│   ├── distribution_model/ # Models for distribution fitting
│   ├── interventions/    # Pruning and other intervention strategies
│   ├── llm/              # LLM interaction and prompt building
│   ├── run/              # Scripts to run evaluations for different datasets/tasks
│   │   ├── big_bench/
│   │   ├── hallu_dial/
│   │   ├── judge_anything_pair/
│   │   ├── judge_bench/
│   │   ├── llm_bar/
│   │   ├── mllm_judge_pair/
│   │   ├── shared/       # Shared utilities for running evaluations
│   │   └── truthful_qa/
│   ├── scripts/          # Shell scripts for running experiments (often wrapping python scripts)
│   └── utils/            # Utility functions (config, logging, etc.)
└── notebooks/            # Jupyter notebooks for visualization and exploratory analysis
    ├── convergence.ipynb
    └── correct_rate.ipynb
```

## Usage

### Configuration

1.  Create a configuration file by copying the example:
    ```shell
    cp config.json.example configs/config.json
    ```
    *(Note: The original README mentioned `cp config.json configs/config.json`. If `config.json.example` is the intended source, please adjust. Assuming `config.json` is a template meant to be copied to a `configs` directory)*

2.  Modify the `configs/config.json` file with your API keys, base URLs, and model configurations.

    Example:
    ```json
    {
        "api_key": "your_openai_api_key",
        "base_url": "your_api_url_if_not_openai_default",
        "models": [
            {"provider": "ollama", "name": "llama3.1:latest", "quantity": 3},
            {"provider": "api", "name": "gpt-4o", "quantity": 1}
        ]
    }
    ```

### Running Evaluations

Evaluation scripts for various datasets are located in the `multi_llm_debate/run/` directory. Each sub-directory typically contains a `main.py` script to execute the evaluation for that specific task.

For example, to run the TruthfulQA evaluation:
```bash
python -m multi_llm_debate.run.truthful_qa.main --config-json path/to/your/configs/config.json [other_arguments]
```

Replace `truthful_qa` with the desired task (e.g., `llm_bar`, `judge_bench`, etc.).
Common arguments include:
*   `--config-json`: Path to your JSON configuration file for models.
*   `--task-name`: A name for the specific run, used for organizing output data.
*   `--sample-size`: Number of samples to process from the dataset.
*   `--batch`: Enable batch processing.
*   `--batch-size`: Number of entries to process in a single batch.
*   `--overwrite`: Overwrite existing results for an entry.
*   Pruning arguments like `--diversity-pruning`, `--diversity-pruning-amount`, etc.

Refer to the specific `main.py` script or use `python -m multi_llm_debate.run.<task_name>.main --help` for detailed options for each task.

Many tasks also have corresponding shell scripts in the `multi_llm_debate/scripts/` directory which can be used to run evaluations, often setting up specific model configurations or server environments.

Example using a script (ensure the script is executable and paths are correct):
```bash
sh multi_llm_debate/scripts/JudgeBench/11_gemma-3-4b-it.sh
```

## Analysis

The `multi_llm_debate/analysis/` directory contains scripts for analyzing the results of the debates:
*   `analyze_convergence.py`: Analyzes the convergence of debates.
*   `average_tokens.py`: Calculates average token usage.
*   `calculate_correct_rate_by_round.py`: Calculates correct rates per debate round.
*   `calculate_correct_rate_distribution.py`: Analyzes the distribution of correct rates.
*   `calculate_task_accuracy.py`: Calculates overall accuracy for tasks.
*   `classify_task_difficulty.py`: Classifies task difficulty based on results.
*   `plot_accuracy.py`: Generates plots for accuracy.
*   `plot_correct_rate_distribution.py`: Plots the distribution of correct rates.

These scripts typically take paths to the output data generated by the evaluation runs.

Example:
```bash
python -m multi_llm_debate.analysis.calculate_task_accuracy --data_dir path/to/evaluation_results
```

## Notebooks

The `notebooks/` directory contains Jupyter notebooks for more interactive analysis and visualization:
*   `convergence.ipynb`: For analyzing and visualizing debate convergence.
*   `correct_rate.ipynb`: For analyzing and visualizing correct rates.

To run these, ensure you have Jupyter installed (`pip install jupyter`) and then:
```bash
jupyter notebook
```
Navigate to the `notebooks` directory and open the desired `.ipynb` file.

## Linting and Formatting

This project uses `flake8` for linting, and `black` and `isort` for code formatting.
Configuration for these tools can be found in `pyproject.toml` and `setup.cfg`.

To run linters and formatters:
```bash
poetry run flake8 .
poetry run black .
poetry run isort .
```
It's recommended to set up pre-commit hooks to automate this:
```bash
poetry run pre-commit install
```