# Controlled mathlib4 subset extraction

This project should not trace all of mathlib in one run. Full mathlib tracing is
too large for routine CPU experiments and can easily duplicate many gigabytes of
build artifacts. The controlled workflow below traces selected mathlib source
files as a small local Lean project.

## Design

1. Choose a small list of Mathlib modules.
2. Copy those source files into `.mathlib_subset_workspace/MathlibSubset/...`.
3. Normalize Lean 4.28 visibility wrappers (`module`, `public import`,
   `@[expose] public section`) in the copied files. This keeps theorem/proof
   bodies intact but avoids LeanDojo 4.20 AST-parser failures on newer syntax.
4. Keep selected-module imports pointing at the original mathlib package by
   default. This makes each copied file an independent trace target and avoids
   duplicate global declarations when a copied dependency has already been
   imported transitively from mathlib. The old behavior can still be enabled
   with `--rewrite-selected-imports` for tiny debugging runs.
5. Depend on the existing local mathlib package by absolute path, so the subset
   workspace reuses `.lake/packages/mathlib` instead of cloning mathlib again.
6. Build copied modules as independent Lake default targets rather than
   importing all of them through one root module. This avoids duplicate
   declarations when a copied low-level module is also imported by another
   module's original mathlib dependencies.
7. Trace only the subset workspace with LeanDojo and `build_deps=False`.

This traces proof steps in the copied mathlib modules while using mathlib cache
artifacts for dependencies. It gives us staged datasets without asking LeanDojo
to walk every mathlib source file.

The dataset builder also applies three LeanDojo compatibility patches at runtime:

- Lean 4.28 extractor types in `ExtractData.lean`;
- Lean 4.28 `sectionHeader` AST nodes around `section`;
- external absolute import paths produced by local mathlib dependencies.

These patches keep the trace focused on the copied subset files and avoid
materializing dependency modules as traced files.

## Smoke Run

```bash
python scripts/prepare_mathlib_subset.py \
  --modules-file configs/mathlib_subset_smoke.txt \
  --workspace .mathlib_subset_workspace \
  --force

cd .mathlib_subset_workspace
lake build
cd ..

python scripts/build_dataset.py \
  --source leandojo \
  --repo .mathlib_subset_workspace \
  --local-clone-dir .leandojo_mathlib_subset_source \
  --max-steps 1000 \
  --output data/mathlib_subset_smoke_steps.jsonl \
  --checked-output data/mathlib_subset_smoke_steps_checked.jsonl
```

Verified on 2026-05-21 with Lean 4.28.0 and mathlib commit
`8f9d9cff6bd728b17a24e163c9402775d9e6a365`:

| Item | Value |
| --- | ---: |
| selected modules | 3 |
| copied source lines | 380 |
| Lake build size | 692 jobs |
| extracted proof steps | 85 |
| theorem names | 62 |
| tactic families | 14 |
| subset workspace disk | 383 MB |
| LeanDojo cache for this trace | 2.8 GB |

The checked smoke dataset is
`data/mathlib_subset_smoke_steps_checked.jsonl`. Its label distribution is
headed by `simp` (23), `decide` (19), `grind` (16), `cases` (7), and `rw` (5).

S1 was verified on 2026-05-21 with
`configs/mathlib_subset_arithmetic.txt --max-modules 8`:

| Item | Value |
| --- | ---: |
| selected modules | 8 |
| copied source lines | 702 |
| Lake build size | 797 jobs |
| workspace preparation time | 2.71 s |
| standalone Lake build time | 58.83 s |
| LeanDojo extraction time | 253.06 s |
| extracted proof steps | 163 |
| theorem names | 51 |
| tactic families | 29 |
| subset workspace disk | 386 MB |
| LeanDojo cache for this trace | 2.8 GB |

The checked S1 dataset is
`data/mathlib_subset_s1_arithmetic8_steps_checked.jsonl`. The 5,000-step cap was
not reached because these eight modules contain only 163 traced proof steps.
The most common families were `rw` (25), `exact` (22), `simp` (21), `have`
(10), `simpa` (10), and `obtain` (9). `Mathlib.Data.Nat.Prime.Basic` dominated
the subset with 92 of the 163 steps.

S2 was verified on 2026-05-21 with the full
`configs/mathlib_subset_arithmetic.txt` profile and a 10,000-step cap:

| Item | Value |
| --- | ---: |
| selected modules | 19 |
| copied source lines | 1,250 |
| selected-import rewrite | disabled |
| Lake build size | 1,049 jobs |
| workspace preparation time | 2.78 s |
| standalone Lake build time | 57.65 s |
| LeanDojo extraction time | 381.13 s |
| extracted proof steps | 268 |
| theorem names | 81 |
| files with traced steps | 17 |
| tactic families | 37 |
| subset workspace disk | 391 MB |
| LeanDojo cache for this trace | 2.8 GB |

The checked S2 dataset is
`data/mathlib_subset_s2_arithmetic_full_steps_checked.jsonl`. The 10,000-step
cap was not reached; this arithmetic profile still contains only 268 traced
proof steps. The most common families were `rw` (50), `simp` (39), `exact`
(33), `obtain` (15), `have` (14), `simpa` (14), `refine` (13), and `apply`
(12). File-level contribution was again concentrated: `Mathlib.Data.Nat.Prime.Basic`
contributed 92 steps, followed by `Mathlib.Algebra.BigOperators.Ring.Nat` and
`Mathlib.Data.Nat.PSub` with 28 steps each.

An attempted S2 build with selected-import rewriting enabled failed because
`MathlibSubset.Mathlib.Data.Rat.Sqrt` imported the copied
`MathlibSubset.Mathlib.Data.Int.Sqrt` after original `Mathlib.Data.Int.Sqrt`
had already entered the environment transitively, duplicating `Int.sqrt`. This
is why selected-import rewriting is off by default.

S3 was verified on 2026-05-21 with arithmetic, mixed small, and an additional
proof-dense module list in `configs/mathlib_subset_high_density.txt`, using a
25,000-step cap:

| Item | Value |
| --- | ---: |
| selected modules | 46 |
| copied source lines | 9,674 |
| selected-import rewrite | disabled |
| Lake build size | 1,544 jobs |
| workspace preparation time | 3.12 s |
| standalone Lake build time | 81.70 s |
| LeanDojo extraction time | 733.66 s |
| extracted proof steps | 2,199 |
| theorem names | 1,022 |
| files with traced steps | 42 |
| tactic families | 72 |
| subset workspace disk | 416 MB |
| LeanDojo cache for this trace | 2.9 GB |

The checked S3 dataset is
`data/mathlib_subset_s3_arith_mixed_dense_steps_checked.jsonl`. The 25,000-step
cap was not reached, but proof-dense module selection increased the dataset by
about 8.2x over S2. The most common families were `rw` (487), `simp` (423),
`exact` (207), `grind` (182), `simpa` (84), `aesop` (67), `refine` (64), and
`have` (61). The largest file contributors were
`Mathlib.Data.Nat.Factorization.Basic` (352), `Mathlib.Data.List.Rotate` (266),
`Mathlib.Data.Nat.ModEq` (159), `Mathlib.Algebra.Ring.Parity` (152), and
`Mathlib.Order.BooleanAlgebra.Basic` (143).

S3 classification and retrieval proxy tables are:

- `results/tables/mathlib_subset_s3_arith_mixed_dense_classification_summary.csv`
- `results/tables/mathlib_subset_s3_arith_mixed_dense_search_summary.csv`

On this split, text/premise Naive Bayes substantially beat majority-class
accuracy. A first hard-priority family-guided retrieval proxy underperformed
unguided retrieval. The implemented soft variant ranks candidates by
`similarity + weight * predicted_family_probability`; with a Naive Bayes family
model and the premise-aware representation, it improved family@1 from 0.365
to 0.437 and exact@1 from 0.174 to 0.180, while larger weights hurt higher-k
retrieval. This supports treating family prediction as a soft ranking prior
rather than a hard filter.

Additional S3 search tables:

- `results/tables/mathlib_subset_s3_arith_mixed_dense_search_soft_sweep.csv`
- `results/tables/mathlib_subset_s3_arith_mixed_dense_search_soft_nb_sweep.csv`
- `results/tables/mathlib_subset_s3_arith_mixed_dense_search_soft_nb_premise_sweep.csv`
- `results/tables/mathlib_subset_s3_arith_mixed_dense_search_paper_summary.csv`

## Growth Schedule

Use these stages, stopping whenever the time or disk budget is exceeded.

| Stage | Module source | Step cap | Purpose |
| --- | --- | ---: | --- |
| S0 | `configs/mathlib_subset_smoke.txt` | 1,000 | Verify build, trace and schema. |
| S1 | `configs/mathlib_subset_arithmetic.txt --max-modules 8` | 5,000 | Arithmetic-only pilot. |
| S2 | `configs/mathlib_subset_arithmetic.txt` | 10,000 | First paper-scale subset. |
| S3 | arithmetic + `configs/mathlib_subset_mixed_small.txt` + `configs/mathlib_subset_high_density.txt` | 25,000 | More tactic-family diversity and proof density. |
| S4 | curated extra modules after error audit | 50,000 | Submission-scale run. |

For each stage, record:

- selected modules and copied source lines from `.mathlib_subset_workspace/subset_manifest.json`;
- output row count from the JSONL file;
- `du -sh .mathlib_subset_workspace ~/.cache/lean_dojo`;
- elapsed time for `lake build` and `scripts/build_dataset.py`;
- any failed module and the Lean/Lake error message.

## Guardrails

- Keep `--dependency-mode local` unless a fully portable reproduction package is
  being prepared.
- Do not pass `--build-deps` to `scripts/build_dataset.py` for subset runs.
- Keep `.mathlib_subset_workspace` and `.leandojo_mathlib_subset_source` ignored.
- Add modules in small batches and run `lake build` before LeanDojo tracing.
- Prefer modules below about 1,500 source lines for iterative experiments.
- Use `--max-total-lines` and `--max-modules` when trying a new profile.
- Inspect cache size after every run with:
  `du -sh .mathlib_subset_workspace ~/.cache/lean_dojo`.
- If disk pressure matters more than rerun speed, remove old subset trace caches
  under `~/.cache/lean_dojo/gitpython-.leandojo_mathlib_subset_source-*`.

## Commands For Paper Tables

After a subset JSONL is generated:

```bash
python scripts/analyze_dataset.py \
  --data data/mathlib_subset_smoke_steps_checked.jsonl \
  --output-json results/dataset/mathlib_subset_smoke_summary.json \
  --summary-csv results/tables/mathlib_subset_smoke_dataset_summary.csv \
  --label-csv results/tables/mathlib_subset_smoke_label_distribution.csv

python scripts/run_experiments.py \
  --data data/mathlib_subset_smoke_steps_checked.jsonl \
  --representations all \
  --table-output results/tables/mathlib_subset_smoke_classification_summary.csv

python scripts/run_multiseed.py --data data/mathlib_subset_smoke_steps_checked.jsonl --representations all
python scripts/run_search.py \
  --data data/mathlib_subset_smoke_steps_checked.jsonl \
  --strategy all \
  --table-output results/tables/mathlib_subset_smoke_search_summary.csv

python scripts/analyze_errors.py --data data/mathlib_subset_smoke_steps_checked.jsonl --representation structured
python scripts/make_paper_tables.py
```
