# Experiment 7: Cross-Judge Analysis

This experiment regrades experiment 1 results with multiple judge models to validate
the robustness of ESR findings across different evaluators.

## Overview

Samples trials from experiment 1 results (prioritizing self-correction examples) and
regrades them using multiple LLM judges to check for:
1. Inter-judge agreement on scores
2. Consistency of multi-attempt detection
3. Robustness of ESR metrics across judges

## Judge Models

The following alternative judges are used (excluding the original Haiku 4.5 judge):
- GPT-5-Mini (via OpenRouter)
- Qwen3-32B (via OpenRouter)
- Gemini 2.5 Flash (via Google API)

Note: Haiku 4.5 is excluded since it was the original judge - we want independent validation.

## Usage

```bash
# Dry run to see what would be done
python run_cross_judge.py --dry-run --n-samples 1000

# Full run
python run_cross_judge.py --n-samples 1000
```

## Output

Results are saved to:
- `experiment_results/claude_haiku_4_5_20251001_judge/cross_judge_results/`

Plots are generated by `plotting/plot_exp7.py`.
