# FormInv: Supplementary Material
## Anonymous submission to AI4Math @ ICML 2026

This package contains machine-verifiable evidence for the two primary claims of the paper.


## Contents

### 1. Lean4 Proof Certificates (`lean_proofs/`)

**Verifying the no-Mathlib proofs (Lean 4.29.1 required):**

```bash
lean lean_proofs/FormInvCertificates.lean
```

Expected output: silent (zero errors). All theorems compile.

**Verifying the Mathlib proofs (Mathlib v4.29.0 required):**

```bash
cd lean_proofs/mathlib_proofs
lake exe cache get    # downloads precompiled Mathlib (~3 GB, one-time)
lake build FormInvMathlib
```

Expected output: exit code 0.

**Summary of theorems:**

| Theorem name | Type | Claim verified |
|---|---|---|
| f5\_err1\_dvd\_add\_biconditional\_is\_false | T3 disproof | Biconditional overreach of Nat.dvd\_add is FALSE (counterexample: a=2, m=1, n=1) |
| f3\_err1\_zero\_divides\_one\_is\_false | T3 disproof | Passive-voice inversion of Nat.dvd\_zero is FALSE (0 does not divide 1) |
| f6\_ge\_iff\_le\_cert | T1 certificate | Comparison-order transform (a>=b iff b<=a) is definitionally equivalent |
| f4\_nat\_mod\_self | T1 certificate | Notation equivalence (n mod n = 0) holds |
| mathcheck\_g38\_malformed\_paraphrase\_is\_wrong | External disproof | MathCheck Group 38: standard parse yields 998, not 498 |
| mathcheck\_g26\_times\_more\_neq\_times\_as\_old | External disproof | "4 times more" is not equal to "4 times as old" for any b > 0 |

The remaining 12 theorems in both files certify additional T1 equivalence families (F6, F7, F8) and T3 disproofs. See the file-level comments for details.



### 2. MathCheck Ranking Change Evidence (`data/`)

**`mathcheck_129_ranking_report.md`:** Full narrative description of the 129-group MathCheck audit. Includes per-group results, flagged items, and model rankings with and without the 4 detected bad paraphrases.

**`mathcheck_129_results_summary.json`:** Machine-readable summary. Key fields:

```json
{
  "n_groups": 129,
  "n_bad": 4,
  "error_rate": 0.031,
  "flagged_groups": ["25", "27", "75", "82"],
  "ranking_with_bad":    ["claude-sonnet-4-6", "gpt-4o", "claude-haiku-4-5", "deepseek-chat"],
  "ranking_without_bad": ["claude-sonnet-4-6", "claude-haiku-4-5", "deepseek-chat", "gpt-4o"]
}
```



### 3. Primary Evaluation Dataset (`data/`)

**`forminv_v3_50.jsonl`:** 366 evaluation items (50 theorems, avg. 7.3 paraphrase families per theorem). Each line is a JSON object with fields: `id`, `theorem_id`, `lean4_statement`, `mathlib_name`, `domain`, `tier`, `family`, `nl_question`, `ground_truth`, `canonical_nl`, `verification_method`.

Ground truth is `TRUE` for all 50 canonical Mathlib4 theorems.

**`false_controls_pilot.jsonl`:** 25 FALSE-ground-truth items (5 theorems, 5 families each). Each item has a verified counterexample documented in the `lean4_statement` field as a comment.



## Reproducibility

The full evaluation pipeline is available as a pip package:

```bash
pip install forminv
forminv eval --dataset data/forminv_v3_50.jsonl --models anthropic:claude-sonnet-4-6
```

All model responses are cached by SHA256 hash of the question text. Re-running the evaluation at temperature 0 returns identical results.

Note: API keys for the respective model providers are required to run new model calls. Cached responses from the evaluation are included in the dataset and do not require API access to inspect.
