# Reproducibility supplement

This folder contains small utilities and pinned dependencies to reproduce the paper’s reported tables/metrics and figures from the released JSONL run logs.

## Requirements
- Python 3.10+ recommended

## Install

Create and activate a virtual environment, then install dependencies:

```bash
python -m venv .venv
```

**macOS / Linux**
```bash
source .venv/bin/activate
pip install -r repro_supplement/requirements.txt
```

**Windows (PowerShell)**
```powershell
.\.venv\Scripts\Activate.ps1
pip install -r repro_supplement\requirements.txt
```

**Windows (cmd.exe)**
```bat
.\.venv\Scripts\activate.bat
pip install -r repro_supplement\requirements.txt
```

## Summarize an eval run (paper metrics)

Compute the paper metrics/tables from one or more eval JSONL files:

```bash
python repro_supplement/summarize_results.py \
  --eval data/runs/eval_dev_....jsonl data/runs/eval_test_....jsonl \
  --label_kind paper \
  --out_dir results/
```

`--label_kind paper` uses **paper-final labels**:
- `human_label` if present
- otherwise `model_final_label`

## Merge human labels into an eval file (paper-final labels)

If you have a separate human-label JSONL, merge it into the eval log to create a single “paper-final” file:

```bash
python repro_supplement/merge_human_labels.py \
  --eval   data/runs/eval_test_....jsonl \
  --labels data/runs/human_labels_....jsonl \
  --out    data/runs/eval_test_...._with_human.jsonl
```

Then summarize the merged file:

```bash
python repro_supplement/summarize_results.py \
  --eval data/runs/eval_test_...._with_human.jsonl \
  --label_kind paper \
  --out_dir results/
```

## Notes
- The scripts expect eval logs to contain (at minimum) `tutor_model`, `tutor_turn2`, `judge_a`, `judge_b`, `pressure_mode`, `domain`, and `confidence`.
- Files named like `human_packets_*.jsonl` are not eval logs and are not intended as direct input to `summarize_results.py`.

