# Software

Code for building the perturbations, running model inference, and scoring
the outputs. The benchmark JSONLs live in the separate data bundle.

## Layout

- `perturbation_construction/` — pipelines that build each of the three local
  edit types (number, symbol, step deletion) from the source datasets
  (miniF2F, MATH-500).
- `inference/` — model inference and the TC + StmtSC + ProofSC + FR/RR/OUR
  scoring stack.
- `prompts/` — the natural-language prompts used for inference, for
  generating the global perturbation rewrites, and for the two LLM judges.

## Requirements

- Python 3.10+
- `vllm` for the open-weight models
- Google Gemini SDK (`google-genai`) for Gemini-3.1-Pro inference and for
  the Gemini-2.5-Flash judge
- Lean 4.15.0 with Mathlib v4.15.0, driven through
  `leanprover-community/repl`

## How to reproduce

1. **Build the benchmark** (or skip and use the data bundle):
   ```
   bash perturbation_construction/number_edit/run_number_edit_pipeline.sh
   bash perturbation_construction/symbol_edit/run_symbol_edit_pipeline.sh
   # step deletion has no single shell wrapper; run the three Python phases below
   ```
   Each pipeline writes the `*_unsound.jsonl` files that ship in the data bundle.

2. **Run inference** per model, e.g.
   ```
   python inference/llm_inference/gpu_inference_Kimina-Prover-RL-1-7B.py \
       --input <BENCHMARK>.jsonl --output <MODEL>_<TAG>_output.jsonl
   ```
   ProofFlow uses `inference/llm_inference/proofflow_inference.py`. Our
   Slurm/K8s launchers are not included; wrap the Python commands in
   whatever scheduler you use.

3. **Score the outputs**:
   ```
   python inference/llm_inference/supplement_tc_sc_v3.py \
       --input <MODEL>_<TAG>_output.jsonl \
       --output <MODEL>_<TAG>_tcsc_v3.jsonl
   ```
   This runs the Lean type-check (TC) and the SC v3 Decoupled judge
   (StmtSC + ProofSC). For local-edit experiments it also produces the
   FR/RR/OUR verdict.

---

## File reference

### `perturbation_construction/number_edit/`

The pipeline runs in three stages, plus scoring helpers.

| File | What it does |
|---|---|
| `label_numeric_roles.py` | **Stage 1.** Gemini-2.5-Flash labels every editable numeric literal in each problem (statement and proof), returning role tags and character offsets. |
| `fix_offsets.py` | **Stage 1.5.** Repairs Gemini's character offsets by re-anchoring the labelled value against the original NL via context matching (Gemini's offsets are often off due to Unicode / LaTeX escaping). |
| `select_candidates.py` | **Stage 2.** Filters the labelled numbers (rule-based: skip subscripts, indices, LaTeX format arguments) and picks one per region (statement or proof). |
| `build_number_edit_unsound.py` | **Stage 3.** Applies a SHA256-seeded deterministic rule to set the new value (sign-preserving, magnitude-proportional), then writes `statement_edit_unsound.jsonl` and `proof_edit_unsound.jsonl`. |
| `score_number_edit_region.py` | Region-restricted FR/RR/OUR scorer for number edits; feeds the judge only the Lean region targeted by the edit. |
| `aggregate_number_edit_results.py` | Aggregates per-sample scored outputs into final FR / RR / OUR rates per model and condition. |
| `lean_region.py` | Helper for extracting the signature / proof-body region of a Lean output. |
| `common.py`, `common_parser.py` | Shared utilities; the parser is robust to LaTeX-bearing JSON returned by LLMs. |
| `run_number_edit_pipeline.sh` | One-shot driver that runs stages 1 → 1.5 → 2 → 3. |
| `__init__.py` | Package marker. |
| `README.md` | Pipeline overview. |

### `perturbation_construction/symbol_edit/`

Mirrors `number_edit/` for relation/operator edits. The files have the same
roles as their `number_edit/` counterparts, with `numeric_roles` → `symbol_roles`
and the substitution table swapping symbols by a fixed map
(`<↔>`, `≤↔≥`, `+↔−`, `×↔÷`):

`label_symbol_roles.py`, `fix_offsets.py`, `select_candidates.py`,
`build_symbol_edit_unsound.py`, `score_symbol_edit_region.py`,
`aggregate_symbol_edit_results.py`, `common.py`,
`run_symbol_edit_pipeline.sh`, `__init__.py`.

### `perturbation_construction/step_edit/`

| File | What it does |
|---|---|
| `label_proof_steps.py` | **Phase 1.** Gemini-2.5-Flash splits each proof into steps and each step into a reasoning part and an outcome, also flagging whether the step is deletable (rules R1–R4: substantive reasoning, standalone outcome, no anaphoric/dangling fragments, no bare numerics). |
| `clean_and_select.py` | **Phase 2.** Deterministic filter that re-checks the hard rules, sorts deletable steps into quality tiers (gold / silver / bronze), and picks the top step by a fixed ranking (tier → reasoning-substance score → outcome-readability score → reasoning length, with a SHA256 tiebreak). |
| `build_step_delete.py` | **Phase 3.** Applies the deletion (removes the chosen step's reasoning, keeps its stated outcome), requires the edited proof to be strictly shorter than the original and to differ in a single contiguous span; writes `step_delete_unsound.jsonl`. |
| `audit_step.py` | **Phase 4.** Independent re-audit of the final built data, re-running every hard rule and reporting pass/fail. |
| `score_step_delete.py` | Stage 5 scorer: runs the FR/RR/OUR judge on Lean outputs against step-delete prompts. |
| `__init__.py` | Package marker. |

### `inference/llm_inference/`

| File | What it does |
|---|---|
| `gpu_inference_Kimina-Prover-RL-1-7B.py` | vLLM inference driver. Shared by every open-weight model in the paper (Kimina-Prover-RL-1.7B, Kimina-Prover-Distill-8B, DeepSeek-Prover-V2-7B, Goedel-Prover-V2-8B, ProofBridge); pass a different model id and chat template. |
| `proofflow_inference.py` | Adapter that wraps the ProofFlow pipeline (graph model + formalizer + solver with Lean checks in the loop) to read our JSONL benchmark and write outputs in the same format as the other models. |
| `supplement_tc_sc_v3.py` | Main scoring driver. For each model output, runs the Lean type-check (TC) via the persistent REPL and calls the SC v3 judge (StmtSC + ProofSC); writes the augmented JSONL used to compute FullyCorrect. |
| `sc_combined_v3.py` | The SC v3 Decoupled judge. Two LLM calls per output: one judges statement equivalence using the NL statement + the Lean signature only; the other judges proof equivalence using the NL proof + the Lean proof body only. |

### `inference/lean_interaction/`

| File | What it does |
|---|---|
| `checkLEAN.py` | Persistent Lean 4 REPL wrapper used by the TC scorer. Includes `sanitize_lean()` which strips leading model artifacts (e.g., a stray ``` `tactics\n` ``` line that some Kimina outputs ship) before sending to the REPL. |

### `prompts/`

| File | What it is |
|---|---|
| `kimina_inference_system.txt` | System prompt for autoformalization inference (the same prompt is used across all open-weight models for fair comparison). |
| `kimina_inference_user.txt` | User prompt that wraps the natural-language problem to be formalized. |
| `kimina_inference_examples.txt` | Few-shot examples appended to the inference prompt. |
| `variant_rephrase.txt` | Prompt that generates the free-form rewrites (G-FF, Q-FF). |
| `variant_step.txt` | Prompt that generates the step-by-step rewrites (G-Step, Q-Step). |
| `variant_faithfulness_rules.txt` | The shared rules block prepended to both rewrite prompts (no value/symbol/hypothesis/step changes, length within ±30% of the original). |
| `sc_v3_stmt.txt` | StmtSC judge prompt (judges the Lean theorem signature against the NL statement; the proof body is hidden). |
| `sc_v3_proof.txt` | ProofSC judge prompt (judges the Lean proof body against the NL proof; the signature is hidden and the judge is told to ignore whether the tactics compile). |
| `edit_judge_number_v2ctx.txt` | FR/RR/OUR judge prompt for number edits. |
| `edit_judge_symbol_v2ctx.txt` | FR/RR/OUR judge prompt for symbol edits. |
| `edit_judge_step_v2ctx.txt` | FR/RR/OUR judge prompt for step deletion. |
| `README.md` | Index of which prompt is used where. |
