# Supplementary Evidence Appendix

This appendix provides extra plots and source-structure checks that support the
main paper's workflow claims. These checks are not needed to reproduce the core
artifact result; the core result remains the Lean build plus the 107-trace axiom
audit.

## Additional Figures

The `figures/` directory contains vector PDFs:

| File | Purpose | Claim boundary |
|---|---|---|
| `cumulative_pr_curve.pdf` | Ledger-derived opened/merged/closed PR trajectory. | Feasibility and queue pressure, not causal speedup. |
| `axiom_funnel.pdf` | PRs opened -> PRs merged -> showcase traces -> canonical traces -> custom axioms found. | Shows audit outcome, not statement adequacy. |
| `loc_by_chain_bar.pdf` | Source lines by proof chain. | Source mass by area, not theorem difficulty. |
| `proof_length_strip.pdf` | Headline theorem proof-block source lengths. | Source-span proxy only, not mathematical complexity. |
| `time_to_merge_histogram.pdf` | Merged-PR lifetime distribution. | PR lifetime, not human review time. |
| `metrics_strip.pdf` | Poster-ready count strip. | Visual restatement of audited counts. |
| `local_reference_chain_matrix.pdf` | Static theorem/lemma local-reference edges by source and referenced chain. | Source-level compositional proxy, not Lean kernel dependency extraction. |
| `local_reference_depth_by_chain.pdf` | Static local-reference-depth distribution by proof chain. | Textual local-reference depth, not theorem difficulty. |

## Static Source-Structure Proxy

The source-structure script is a first-pass answer to the question of whether
the artifact is mostly isolated wrapper lemmas or contains reusable internal
structure.

Run from the supplement root:

```bash
python3 scripts/analyze_source_structure.py \
  --tarball formalslt_anonymized.tar.gz \
  --json-out SOURCE_STRUCTURE_ANALYSIS.json
```

Expected summary for this snapshot:

```text
Files: 46
Source lines: 20080
Theorem/lemma declarations: 412
Mathlib import lines: 174
Internal FormalSLT import lines: 112
Declarations with local refs: 257 (62.4%)
```

Methodology: declaration spans run from a line-leading `theorem` or `lemma` to
the next top-level declaration. The script counts a theorem/lemma span as using
local structure when it textually references another local FormalSLT theorem or
lemma name, excluding short/common names such as `symm` and `mono`.

This is a conservative source-structure proxy. It does not count every Mathlib
lemma call, and it does not prove theorem novelty or proof difficulty.

## Static Declaration Local-Reference Graph

The local-reference graph script gives a stricter version of the local-reuse check:
nodes are comment-skipped line-leading theorem/lemma declarations, and an edge
`A -> B` is added only when declaration `A` textually references the unique local
theorem/lemma name for `B`. Duplicate names, short names, and common names are
ignored.

Run from the supplement root:

```bash
python3 scripts/analyze_local_reference_graph.py \
  --tarball formalslt_anonymized.tar.gz \
  --json-out LOCAL_REFERENCE_GRAPH_ANALYSIS.json \
  --markdown-out LOCAL_REFERENCE_GRAPH_ANALYSIS.md \
  --matrix-csv LOCAL_REFERENCE_CHAIN_MATRIX.csv \
  --fig-dir figures
```

Expected summary for this snapshot:

```text
Declarations: 412
Static local-reference edges: 463
Declarations with textual local-reference edges: 257
Declarations referenced by others: 295
Largest weak component: 177
Max static local-reference depth: 18
```

This is evidence that the artifact has source-level compositional structure. It
is not Lean kernel dependency extraction and does not measure proof difficulty,
mathematical novelty, or agent productivity. The scan can miss references
through duplicate names, notation, generated names, local aliases, definitions,
instances, or Mathlib declarations. It can also count names appearing in theorem
statements rather than proof bodies. Raw chain-to-chain edge counts are not
normalized by chain size.

## Chain-Level Decomposition

| Chain | Files | LoC | theorem/lemma decls | Mathlib imports | FormalSLT imports | Decls with local refs |
|---|---:|---:|---:|---:|---:|---:|
| Covering | 4 | 6,460 | 134 | 13 | 6 | 73.1% |
| Rademacher | 13 | 4,050 | 72 | 78 | 23 | 55.6% |
| Azuma | 9 | 2,392 | 58 | 16 | 14 | 44.8% |
| Stability | 2 | 2,329 | 54 | 7 | 4 | 72.2% |
| PAC-Bayes | 4 | 2,022 | 34 | 14 | 5 | 79.4% |
| Risk/ERM spine | 5 | 1,193 | 21 | 15 | 50 | 66.7% |
| PAC-VC | 6 | 1,107 | 23 | 13 | 10 | 56.5% |
| Probability | 3 | 527 | 16 | 18 | 0 | 0.0% |

## What the Current Ledger Cannot Answer

The anonymized workflow ledger records PR target family, opened/closed times,
merge decision, agent tag, and anonymized final commit hash. It does not contain
CI start/end timestamps, human-review start/end timestamps, local failed
attempts before PR creation, or line-level human/agent authorship. Therefore the
submission does not claim:

- Claude-vs-Codex performance;
- agent-vs-human line attribution;
- CI time versus human review time;
- gate-specific counts for the 7 closed-without-merge PRs;
- a single-agent baseline speedup.

These are appropriate follow-up measurements, but they are outside the current
snapshot.
