# Number Edit Experiment (Exp2)

Tests whether an NL→Lean autoformalizer faithfully transfers a small numeric
perturbation from the informal statement / proof into the formal proof.

**Authoritative spec**: see [docs/exp2_number_edit.md](../docs/exp2_number_edit.md).
This README is a quick orientation for the code layout only.

## Pipeline (4 stages)

```
label_numeric_roles.py         Stage 1 — Gemini lists all editable numeric literals per problem
    → select_candidates.py     Stage 2 — filter + sha256-seeded random pick per source
    → build_number_edit_unsound.py   Stage 3 — SHA256 deterministic perturbation + text rewrite
    → inference (NRP vLLM)     Stage 4 — Kimina-Prover-RL-1.7B {base, SFT}
    → score_number_edit_v2.py  Stage 5 — Gemini 2.5 Flash judge, FR/RR/OUR
    → aggregate_number_edit_results.py   Stage 6 — per-bucket summary
```

Output buckets: `statement_edit`, `proof_edit` (2 total).

## Files

| File | Role |
|---|---|
| `label_numeric_roles.py`         | Stage 1 — Gemini label prompt, emits `{problem_name, numbers: [...]}` |
| `select_candidates.py`           | Stage 2 — drop non-literals, random pick one per source |
| `build_number_edit_unsound.py`   | Stage 3 — perturb via SHA256, rewrite statement or proof |
| `common.py`                      | `load_jsonl/write_jsonl`, `perturb_numeric_string` (SHA256 seeded, magnitude-proportional), `replace_span` with LaTeX-fraction fallback |
| `common_parser.py`               | `robust_json_loads` — handles Gemini's LaTeX-backslash JSON quirks |
| `score_number_edit_v2.py`        | Stage 5 — Gemini 2.5 Flash scorer, FR/RR/OUR |
| `aggregate_number_edit_results.py` | Stage 6 — per-edit-type FR/RR/OUR table |
| `run_number_edit_pipeline.sh`    | Local end-to-end driver (label → build → local vLLM infer → score → aggregate) |

## Metrics

All metrics from `aggregate_number_edit_results.py`:

- **FR** (Faithful Rate) — model's Lean output contains the new (edited) numeric value
- **RR** (Reversion Rate) — model's Lean output contains the old (original) value instead
- **OUR** (Other/Unclear Rate) — neither; empty / error / unrelated output

`FR + RR + OUR = 1`. Gemini 2.5 Flash is the judge; it handles equivalences like `6.5 ↔ 13/2`.

## Minimal local run

```bash
export GOOGLE_API_KEY=...
bash number_edit/run_number_edit_pipeline.sh
```

## Minimal manual steps

```bash
# Stage 1: label
python -m number_edit.label_numeric_roles \
    --input ./datasets_validation/minif2f/dataset.jsonl \
    --output ./number_edit/data/labeled_numbers.jsonl

# Stage 2: select
python -m number_edit.select_candidates \
    --input ./number_edit/data/labeled_numbers.jsonl \
    --output ./number_edit/data/selected_numbers.jsonl

# Stage 3: build
python -m number_edit.build_number_edit_unsound \
    --input ./datasets_validation/minif2f/dataset.jsonl \
    --labels ./number_edit/data/selected_numbers.jsonl \
    --output_dir ./number_edit/data

# Stage 4: inference on NRP (see cluster/nrp_inference/)

# Stage 5: score
python -m number_edit.score_number_edit_v2 --all

# Stage 6: aggregate
python -m number_edit.aggregate_number_edit_results \
    --input ./number_edit/data/scored/statement_edit.scored.jsonl
python -m number_edit.aggregate_number_edit_results \
    --input ./number_edit/data/scored/proof_edit.scored.jsonl
```
