# Running with your own dataset

This guide explains how to run the Overton benchmark and LLM pipeline on **your own data**. For standard reproduction using the OvertonBench dataset from Hugging Face, see [README.md](README.md).

**Overview:** If your dataset matches the expected schema (below), you can use it throughout the pipeline by setting `DATASET` in `.env` or passing `--data path/to/your.csv` to the benchmark and prediction scripts. The same data is used for human Overton scores, LLM-as-a-judge predictions, baselines, and (with the right columns) LOMO generalization and parity analysis.

---

## Dataset schema: what your CSV should look like

Your CSV should follow the same structure as [OvertonBench](https://huggingface.co/datasets/elinorpd/overtonbench). Each row is one **human rating** of one **model response** to one **question** by one **rater**.

### Required columns (all pipelines)

| Column                     | Description |
| -------------------------- | ----------- |
| `user`                     | Unique identifier for the rater/annotator. |
| `question_id`              | Unique identifier for the question. |
| `model`                    | Name or identifier of the model being rated. |
| `representation_rating`    | Numeric rating (e.g. 1–5 Likert) for that user/model/question. |
| **Cluster column**         | A column with cluster assignments per (user, question). The codebase default is `cluster_kmeans`; you can use another name and pass `--cluster_col your_col` to the benchmark and generalization script. |

The **benchmark** (`benchmark_overton_pipeline.py`) and **generalization** (`lomo_generalization_metrics.py`) script require the cluster column. Rows with missing cluster labels are dropped. When using your own data, you need to assign clusters (e.g. via your own clustering or methodology) and pass the column name to the relevant scripts with `--cluster_col`. 

The k-means clustering pipeline used in the paper is not included here because it highly depends on the specific format of the human voting data. We document our approach in [Appendix C](https://arxiv.org/pdf/2512.01351#appendix.C) which is based on [Small et al. (2021)](https://doi.org/10.6035/recerca.5516), and are happy to share the implementation upon request. If you are interested, please contact the first author at elinorpd [at] mit [dot] edu.

### Required for LLM predictions and baselines

| Column          | Description |
| --------------- | ----------- |
| `question`      | The question text. |
| `llm_response`  | The model-generated response being rated. |

These are used by `prediction.py`, `semantic_baseline.py`, and by the few-shot logic that looks up other ratings for the same user/question.

### Optional (for full prompt variants and parity)

- **Free-response / few-shot:** `freeresponse` — the user’s written opinion on the question. Used by `fr`, `fs`, and `fs+fr` prompts.
- **Demographics (prompts that use them):**  
  The code looks for: `Age` or `age`, `Sex` or `gender`, `Ethnicity simplified` or `race`, `U.s. political affiliation` or `political`. If missing, the corresponding prompt fields receive empty strings.
- **Parity notebook:** The subgroup parity analysis uses columns such as `Sex`, `Ethnicity simplified`, `U.s. political affiliation`, `selection_position`, and `model`. For your own data, you can change the notebook’s `CATEGORIES` list to match your column names.

### Minimal example (benchmark + predictions)

For the **human benchmark** only, you need at least:  
`user`, `question_id`, `model`, `representation_rating`, and a cluster column (e.g. `cluster_kmeans`).

For **LLM predictions** and **baselines**, you also need:  
`question`, `llm_response`.  
Adding `freeresponse` enables the best-performing prompt (`fs+fr`) and other variants that use it.

---

## How to use your data

Run all commands from the repository root.

- **Environment:** Set `DATASET=path/to/your.csv` in `.env` (optional).  
- **CLI override:** Pass `--data path/to/your.csv` to any script that supports it; this overrides `DATASET` for that run.

Scripts that accept `--data` and/or `--source` arguments:

- `benchmark_overton_pipeline.py` — `--data`; cluster column via `--cluster_col` (default `cluster_kmeans`).
- `prediction.py` — `--data`; `--source` is only for Hugging Face splits (e.g. `modelslant`, `prism`), not for custom CSV.
- `semantic_baseline.py` — `--data`, optional `--n_rows` for sampling.
- `lomo_generalization_metrics.py` — `--data` for human data CSV; `--source` for HF split; `--preds_csv` and `--pred_col` for your prediction file.

When using a **custom CSV** (not Hugging Face), prediction and baseline outputs do not get a `_modelslant`/`_prism` suffix; they use the default naming (e.g. `gemini_all_rows_fr+fs.csv`, `baselines_rounded_custom.csv` for baselines when `--data` is set).

---

## 1. Human benchmark

Use your own file by setting `DATASET` in `.env` or passing `--data path/to/your.csv`. Your CSV must include:

- `user`, `question_id`, `model`, `representation_rating`
- A cluster assignment column (default name: `cluster_kmeans`; override with `--cluster_col`)

Example:

```bash
python src/benchmark_overton_pipeline.py \
  --data path/to/my_data.csv \
  --cluster_col cluster_kmeans \
  --weighted
```

**Options:**  
`--tau` (default 4.0), `--outdir` (default `outputs/`), `--weighted`, `--emit-oc-per-question`, `--upper-bound`. See script help or [README.md](README.md) for outputs.

---

## 2. LLM predictions

By default the pipeline loads from Hugging Face. To use your own CSV: set `DATASET` in `.env` or pass `--data path/to/your.csv`.

- **Model:** Configure API keys in `.env` and add your model in `src/prompting_pipeline/llm_api.py` if needed.
- **Prompts:** Templates and fields are in `src/prompting_pipeline/prompts.py`; the prompt map is in `src/prompting_pipeline/prediction.py`. Best-performing variant in the paper: **fs+fr** (few-shot + free-response).

Example (single prompt, Gemini):

```bash
python src/prompting_pipeline/prediction.py \
  --client gemini \
  --prompt fs+fr \
  --max_workers 8
```

With your own data:

```bash
python src/prompting_pipeline/prediction.py \
  --client gemini \
  --prompt fs+fr \
  --data path/to/your.csv \
  --max_workers 8
```

To run multiple prompt types in one go, use `--prompts` (e.g. `--prompts fr fs fs+fr`). Optional: `--n_rows N` to run on a random sample of N rows.

Results are saved under `outputs/predictions/` with filenames derived from client, row count, prompt, and (when using HF) source split.

---

## 3. Baselines

Run semantic similarity and mean-of-others baselines:

```bash
python src/prompting_pipeline/semantic_baseline.py
```

With your own data: `--data path/to/your.csv`. Optional: `--source modelslant` or `--source prism` for HF splits; `--n_rows N` for a sample. Output: `outputs/predictions/baselines_rounded.csv` (or `baselines_rounded_custom.csv` when using `--data`, plus `_modelslant`/`_prism`/`_N` as applicable).

---

## 4. LLM evaluation and analysis

- **LOMO generalization:** Run `lomo_generalization_metrics.py` with `--source modelslant` (or your HF split) when using OvertonBench. For custom data, pass `--data path/to/your.csv` and `--preds_csv` / `--pred_col` to point to your prediction file.

- **Primary metrics** and **subgroup parity** use the notebooks below. Each notebook has a cell at the top where you can adjust the variables when using your own prediction files or column names.

### Primary eval notebook

**File:** `src/prompting_pipeline/primary_eval.ipynb`

**Config (second code cell at top of notebook):**

| Variable | Description |
| -------- | ----------- |
| `PREDICTIONS_DIR` | Directory containing prediction CSVs (default: `../../outputs/predictions`). |
| `FILE_FR`, `FILE_FR_FS`, `FILE_FS` | Filenames for the three Gemini prompt outputs (fr, fr+fs, fs). |
| `BASELINES_FILE` | Filename for the baselines CSV (default: `baselines_rounded_modelslant.csv` for ModelSlant paper repro; use `baselines_rounded.csv` for full split). |
| `GOLD_COL` | Ground-truth rating column (default: `representation_rating`). |
| `PRED_COLS` | List of prediction columns to evaluate (default: `gemini_fr_avg`, `gemini_fr+fs_avg`, `gemini_fs_avg`, `sem_sim_avg`, `mean_of_others_avg`). |

For custom runs (e.g. different split or your own predictions), set `PREDICTIONS_DIR` and the file names to match your outputs. If your prediction CSVs use different column names, update `GOLD_COL` and `PRED_COLS`. You may also need to extend the `prompt_info` dict in the notebook for display labels when adding new prediction columns.

### Parity analysis notebook

**File:** `src/prompting_pipeline/parity_analysis.ipynb`

**Config (first code cell):**

| Variable | Description |
| -------- | ----------- |
| `PREDICTIONS_PATH` | Path to the prediction CSV (default: `../../outputs/predictions/gemini_all_rows_fr+fs_modelslant.csv`). |
| `PRED_COL` | Name of the prediction column in that CSV (default: `gemini_fr+fs_avg`). |
| `GOLD_COL` | Ground-truth rating column (default: `representation_rating`). |
| `CATEGORIES` | List of column names used for subgroup parity (default: `Sex`, `Ethnicity simplified`, `U.s. political affiliation`, `selection_position`, `model`). |

For your own data, point `PREDICTIONS_PATH` at your prediction file and set `PRED_COL` and `GOLD_COL` to match your schema. Set `CATEGORIES` to the column names in your CSV that define the subgroups you want to test (e.g. demographics, model, or other categorical variables).