# Benchmarking Overton Pluralism in LLMs

[![arXiv](https://img.shields.io/badge/arXiv-2512.01351-b31b1b.svg)](https://arxiv.org/abs/2512.01351)
[![Dataset](https://img.shields.io/badge/Dataset-HuggingFace-yellow.svg)](https://huggingface.co/datasets/elinorpd/overtonbench)
[![License](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)

This repository contains code for "[Benchmarking Overton Pluralism in LLMs](https://arxiv.org/abs/2512.01351)" (ICLR 2026) by [Elinor Poole-Dayan](https://elinorp-d.github.io/), [Jiayi Wu](https://jiayiw005.github.io/), [Taylor Sorensen](https://tsor13.github.io/), [Jiaxin Pei](https://jiaxin-pei.github.io/), and [Michiel A. Bakker](https://miba.dev/). The data for this work can be found on Hugging Face [here](https://huggingface.co/datasets/elinorpd/overtonbench).

This project proposes a formal metric for measuring Overton pluralistic alignment, a concept introduced by [Sorensen et al., 2024](https://arxiv.org/pdf/2402.05070) where a model provides comprehensive, high-coverage responses, representing a spectrum of reasonable responses. This contrasts with alignment to a single viewpoint or a limited set of perspectives.

<img src="assets/fig1.png" />

The instructions below are for **reproducing the paper's results** using the OvertonBench dataset from Hugging Face. For running the pipeline on **your own dataset**, see [README_extended.md](./README_extended.md).

## Setup

1. Create an environment and install dependencies:
   - **Reproducible (recommended):** `pip install -r requirements-lock.txt` — installs the exact versions as used in the paper.
   - **Otherwise:** `pip install -r requirements.txt` — installs the latest compatible versions.
2. Create a `.env` file in the project root with API keys for whichever LLMs you use.

Example `.env`:
```
OPENAI_API_KEY=sk-...
GOOGLE_API_KEY=AI...
# DATASET=path/to/your.csv   # optional; uncomment to use your own data instead of HF
```
Run all commands below from the repository root.

**Data:** The pipeline loads the OvertonBench dataset from [Hugging Face](https://huggingface.co/datasets/elinorpd/overtonbench) by default. To use the ModelSlant or PRISM split instead of the full dataset, pass `--source modelslant` or `--source prism` to scripts as needed.

## Code Overview

```
.
├── README.md
├── assets                   # overview figure (fig1.png)
├── outputs                  # main paper results (reference) + generated outputs
│   ├── overton_scores_and_ols_tau4.0.csv   # human benchmark scores & OLS (τ=4.0)
│   ├── overton_scores_and_ols_tau4.0.md
│   └── predictions          # LLM prediction outputs and baselines
├── requirements.txt        # direct dependencies
├── requirements-lock.txt   # pinned versions for reproducible installs
└── src
    ├── benchmark_overton_pipeline.py   # human Overton scores and OLS
    ├── load_dataset.py                # load OvertonBench from Hugging Face
    └── prompting_pipeline            # LLM predictions, baselines, eval notebooks
```

# Main Paper Reproduction

The main pipeline loads the OvertonBench dataset from Hugging Face by default. For the full data-processing pipeline (e.g. running on your own raw data), see [README_extended.md](./README_extended.md).

## 1. Human Benchmark Results
The main analysis script is [`benchmark_overton_pipeline.py`](./src/benchmark_overton_pipeline.py), which produces Overton benchmark scores and regression diagnostics for each model.

```bash
python src/benchmark_overton_pipeline.py --weighted
```

### Outputs:

All files are saved to the output directory specified by `--outdir` (default `outputs/`). Filenames include the threshold τ (default 4.0); use `--tau` to change it.

- `overton_scores_and_ols_tau{tau}.csv` (e.g. [`overton_scores_and_ols_tau4.0.csv`](./outputs/overton_scores_and_ols_tau4.0.csv)) — Quantitative Overton scores and OLS regression results per model.
- `overton_scores_and_ols_tau{tau}.md` (e.g. [`overton_scores_and_ols_tau4.0.md`](./outputs/overton_scores_and_ols_tau4.0.md)) — Markdown summary of the main table for easy viewing.

### Running Individual Splits

To run the human benchmark for a specific split (e.g., "modelslant" or "prism") instead of the full OvertonBench dataset, provide the `--source` argument to the script:

```bash
python src/benchmark_overton_pipeline.py --weighted --source modelslant
# or
python src/benchmark_overton_pipeline.py --weighted --source prism
```

This will generate the Overton scores and results for the specified split separately, with output filenames (and results) corresponding to the chosen source.


## 2. LLM Predictions with Best-Performing Judge
The best-performing judge was Gemini 2.5 Pro using a few-shot prompt containing example user ratings of other LLM responses to the same question as well as a user's written free response (FS+FR).

To run these predictions
```bash
python src/prompting_pipeline/prediction.py \
  --client gemini \        # runs Gemini Pro 2.5
  --prompt fs+fr \         # prompt selection
  --max_workers 8          # uses up to 8 parallel workers
```

Optional: `--source modelslant` or `--source prism` to use that split (output filenames get a `_modelslant`/`_prism` suffix).

All prediction results are saved under `outputs/predictions/`.

### Baselines

LLM performance is compared against two baselines. In order to reproduce our baseline results, run [`semantic_baseline.py`](src/prompting_pipeline/semantic_baseline.py): 
- Semantic similarity baseline: selects the closest among the seven other responses to the same question, and assigns that rating. Results stored as `sem_sim_avg` and `sem_sim_diff` in `/outputs/predictions/baselines_rounded.csv`
- Mean-of-others baseline: uses the average of the user’s ratings for the other seven responses, rounded to the nearest integer to match the 1–5 Likert scale  values. Results stored as `mean_of_others_avg` and `mean_of_others_diff` in `/outputs/predictions/baselines_rounded.csv`

Optional: `--source modelslant` or `--source prism` for that HF split; `--n_rows N` to run on a random sample of N rows (e.g. for quick testing). Outputs go to `outputs/predictions/baselines_rounded.csv` (or `baselines_rounded_modelslant.csv`, `baselines_rounded_prism.csv` when using those splits).

## 3. LLM Benchmark Eval & Analysis

To assess the alignment between our LLM judge's predictions and human participant scoring, we evaluate performance using a range of metrics, including basic accuracy/error checks, generalization tests, and subgroup parity analyses.

We provide prediction and baseline outputs for the ModelSlant subset, which can be used as-is to reproduce the paper results. If you wish to run evaluations on the full dataset or your own splits, first run steps 1 and 2 so the complete prediction and baseline outputs are available. 

### A. Primary Metrics
We evaluate judges primarily by mean absolute error (MAE), mean squared error (MSE), and Spearman rank correlation. We also calculate a win-rate percentage, which is the  proportion of datapoints with lower error compared to another method (ties reported separately). 

All analysis and plots of primary metrics can be reproduced by [`primary_eval.ipynb`](./src/prompting_pipeline/primary_eval.ipynb). The notebook’s default config points to the ModelSlant prediction and baseline files; run all cells in order (no need to change paths for paper reproduction). 

### B. Generalization
To test whether our benchmark generalizes to unseen models, we ran a leave-one-model-out (LOMO) analysis: for each target LLM, we replaced its human ratings with best LLM predictions and re-ran the OvertonScore OLS regressions. 

This repository includes prediction results for Gemini with fr+fs, fr, and fs on the ModelSlant split in [`outputs/predictions/`](./outputs/predictions/). To reproduce the paper results, run with `--source modelslant`.

- **Default (fr+fs):** 
```
python src/prompting_pipeline/lomo_generalization_metrics.py --source modelslant
```
- **Free-response only (fr):** 
```
python src/prompting_pipeline/lomo_generalization_metrics.py --source modelslant --preds_csv outputs/predictions/gemini_all_rows_fr_modelslant.csv --pred_col gemini_fr_avg
```
- **Few-shot only (fs):** 
```
python src/prompting_pipeline/lomo_generalization_metrics.py --source modelslant --preds_csv outputs/predictions/gemini_all_rows_fs_modelslant.csv --pred_col gemini_fs_avg
```

What it computes:
- Human baseline: OLS (LPM) with question FEs, cluster-robust SEs; adjusted coverage = average prediction
    first within question, then across questions (equal weight per question).
- LOMO folds: for each target model, substitute that model’s ratings with predictions, recompute adjusted
    coverage, refit OLS, and compare to the human baseline.
- Generalization metrics per fold (rank correlations, coef correlations/MAE, direction agreement, target sig replication).
- A delta table: one row per model, showing human adj_coverage vs that model's LOMO adj_coverage and their difference.

### C. Subgroup Parity
To test whether LLM performance yields higher accuracy for some groups more than others, we test for subgroup disparities using  nonparametric permutation ANOVA tests (5000 permutations) for each category (sex, ethnicity, Political party, selection position, and model) and each metric (MAE, MSE). 

To reproduce paper results, simply run all cells in order in [`src/prompting_pipeline/parity_analysis.ipynb`](src/prompting_pipeline/parity_analysis.ipynb). The final dataframe that is displayed at the end is the output.


# Beyond

For running on your own dataset, see [README_extended.md](./README_extended.md).

# Citation
If you use this repository or dataset, please cite the original paper associated with it:

```bibtex
@inproceedings{poole-dayan2026benchmarking,
author = {Poole-Dayan, Elinor and Wu, Jiayi and Sorensen, Taylor and Pei, Jiaxin and Bakker, Michiel A.},
title = {Benchmarking Overton Pluralism in LLMs},
booktitle = {The Fourteenth International Conference on Learning Representations (ICLR)},
year = {2026},
month = apr,
url = {https://arxiv.org/abs/2512.01351}
}
```


# License

This project is released under the MIT License. See the [`LICENSE`](./LICENSE) file for details.
