# Claude Code Workflow Evaluation

Run directory: `outputs/2026-05-08_pdf10_mdnav`

This is an operational trace, not a human-subjects study. Both conditions used Claude Code in non-interactive mode with the same model and a fixed task.

## Summary Table

| Condition | Wall time | Claude turns | Tool calls | Tool text | Output tokens | Estimated blocking model calls | Cost | Outcome |
|---|---:|---:|---:|---:|---:|---:|---:|---|
| `lacuna_navigation` | 138.4s | 8 | 7 | 41.5KB | 8283 | 1 | $0.217 | Under what conditions does iterative compiler feedback enable LLMs to close the autoformalization gap, and at what di... |
| `pdf_to_chat_baseline` | 192.6s | 9 | 8 | 91.1KB | 10691 | 6 | $0.316 | To what extent does reinforcement learning from proof-assistant verifier feedback (Lean REPL interactions) generalize... |

## Tool Commands

### lacuna_navigation

- `curl -sS "http://34.8.208.118/md/render/direction/the-autoformalization-gap-in-theorem-proving-23293" | head -c 30000`
- `curl -sS "http://34.8.208.118/md/render/direction/iterative-compiler-feedback-for-formal-theorem-proving-12169" | head -c 30000`
- `curl -sS "http://34.8.208.118/md/render/paper/proofnet-autoformalizing-and-formally-proving-undergraduate-level-mathematics/art_f5a5f3551f6641598e578328d5771b3b" | head -c 30000`
- `curl -sS "http://34.8.208.118/md/render/paper/fimo-a-challenge-formal-dataset-for-automated-theorem-proving/art_d65417c966cc4525ab8bd63248394361" | head -c 30000`
- `curl -sS "http://34.8.208.118/md/render/paper/stepfun-prover-preview-let-s-think-and-verify-step-by-step/art_804f191b837940a1a2b73568d30d29b0" | head -c 30000`
- `curl -sS "http://34.8.208.118/md/render/paper/verina-benchmarking-verifiable-code-generation/art_ae1f0fb500a44279ba999f4c6df4ed6e" | head -c 30000`
- `curl -sS "http://34.8.208.118/md/render/hypothesis/token-level-alignment-of-informal-mathematics-to-formal-compiler-states-5f6a6e04c0bdbac4" | head -c 30000`

### pdf_to_chat_baseline

- `curl -L -sS -o /tmp/proofnet.pdf https://arxiv.org/pdf/2302.12433 && pdftotext -f 1 -l 10 -layout /tmp/proofnet.pdf - | head -c 50000`
- `cat <claude_tool_result_path>`
- `pdftotext -f 1 -l 10 -layout /tmp/proofnet.pdf - | head -c 50000 | strings | head -200`
- `curl -L -sS -o /tmp/fimo.pdf https://arxiv.org/pdf/2309.04295 && pdftotext -f 1 -l 10 -layout /tmp/fimo.pdf - | head -c 50000 | strings | head -300`
- `pdftotext -f 1 -l 10 -layout /tmp/fimo.pdf - | head -c 50000 | strings | sed -n '50,300p'`
- `curl -L -sS "https://stepfun.ai/research/en/stepfun-prover-preview" | head -c 50000`
- `curl -L -sS "https://stepfun.ai/research/en/stepfun-prover-preview" | python3 -c "
import sys, re
html = sys.stdin.read()
# Remove scripts, styles, head
html = re.sub(r'<script[^>]*>.*?</script>', '', html, flags=re.DOTALL)
html = re.sub(r'<style[^>]*>.*?</style>', '', html, flags=re.DOTALL)
html = re.sub(r'<head[^>]*>.*?</head>', '', html, flags=re.DOTALL)
# Remove tags
text = re.sub(r'<[^>]+>', ' ', html)
# Clean whitespace
text = re.sub(r'[ \t]+', ' ', text)
text = re.sub(r'\n\s*\n+', '\n\n', text)
print(text[:20000])
"`
- `curl -L -sS -o /tmp/verina.pdf https://arxiv.org/pdf/2505.23135 && pdftotext -f 1 -l 10 -layout /tmp/verina.pdf - | head -c 50000 | strings | head -300`

## Final Outputs

### lacuna_navigation

```json
{
  "condition": "lacuna_navigation",
  "route": [
    {
      "url": "http://34.8.208.118/md/render/direction/the-autoformalization-gap-in-theorem-proving-23293",
      "reason": "Primary anchor: establishes the conceptual framing of the autoformalization gap \u2014 granularity mismatch, implicit knowledge, neuro-symbolic bridges, and semantic misalignment \u2014 grounding all downstream paper readings."
    },
    {
      "url": "http://34.8.208.118/md/render/direction/iterative-compiler-feedback-for-formal-theorem-proving-12169",
      "reason": "Operational complement to the gap direction: explains how REPL-based iterative feedback transforms proof generation from one-shot to agent-loop, and crucially identifies where feedback fails (semantic wall, diminishing returns)."
    },
    {
      "url": "http://34.8.208.118/md/render/paper/proofnet-autoformalizing-and-formally-proving-undergraduate-level-mathematics/art_f5a5f3551f6641598e578328d5771b3b",
      "reason": "Empirical benchmark grounding: provides concrete accuracy numbers (16.1% best-case with retrieval), identifies implicit-hypothesis failure mode, and introduces typechecking rate as proxy metric."
    },
    {
      "url": "http://34.8.208.118/md/render/paper/fimo-a-challenge-formal-dataset-for-automated-theorem-proving/art_d65417c966cc4525ab8bd63248394361",
      "reason": "High-difficulty baseline: GPT-4 achieves 0% solve rate on IMO-level problems even with ground-truth informal proofs as hints, establishing the 'reasoning wall' between formalization success and proof search capability."
    },
    {
      "url": "http://34.8.208.118/md/render/paper/stepfun-prover-preview-let-s-think-and-verify-step-by-step/art_804f191b837940a1a2b73568d30d29b0",
      "reason": "State-of-the-art positive result: 70.0% pass@1 on miniF2F via GRPO-based tool-integrated RL, showing that embedding verification into the reasoning trajectory outperforms models 20x larger."
    },
    {
      "url": "http://34.8.208.118/md/render/paper/verina-benchmarking-verifiable-code-generation/art_ae1f0fb500a44279ba999f4c6df4ed6e",
      "reason": "Holistic stress-test: joint code+spec+proof generation reveals 3.2% end-to-end success for best model (o3), and 64-round iterative feedback plateaus at ~20% proof success \u2014 quantifying the hard ceiling of feedback-loop approaches."
    },
    {
      "url": "http://34.8.208.118/md/render/hypothesis/token-level-alignment-of-informal-mathematics-to-formal-compiler-states-5f6a6e04c0bdbac4",
      "reason": "Forward-looking mechanistic proposal: CTC/DTW-style token-to-compiler-state alignment as a credit-assignment fix, bridging the gap between coarse feedback signals and the localized corrections needed for deep proof search."
    }
  ],
  "pages_read_count": 7,
  "branch_decisions": [
    "Read both direction pages before individual papers to build shared conceptual vocabulary (granularity mismatch, REPL loop, semantic wall) before encountering specific empirical results.",
    "Ordered FIMO before StepFun-Prover so the 0% IMO baseline precedes the 70% miniF2F success \u2014 prevents the positive SOTA result from obscuring how hard the underlying task remains at higher difficulty.",
    "Placed VERINA after StepFun-Prover to show limits of iterative approaches at harder, multi-component tasks, providing a natural arc from optimism to qualified constraint.",
    "Ended at the token-level alignment hypothesis rather than a second paper, because it synthesizes the credit-assignment problem identified across all prior pages and points to a concrete mechanistic research direction."
  ],
  "source_checks": [
    {
      "claim": "Codex with prompt retrieval achieves 16.1% formalization accuracy on ProofNet, up from 13.4% baseline, with typecheck rate rising from 23.7% to 45.2%.",
      "checked_against": "http://34.8.208.118/md/render/paper/proofnet-autoformalizing-and-formally-proving-undergraduate-level-mathematics/art_f5a5f3551f6641598e578328d5771b3b (Key Results section)",
      "status": "supported"
    },
    {
      "claim": "GPT-4 achieves a 0% pass rate on solving FIMO problems even when provided with the ground-truth informal proof as a hint.",
      "checked_against": "http://34.8.208.118/md/render/paper/fimo-a-challenge-formal-dataset-for-automated-theorem-proving/art_d65417c966cc4525ab8bd63248394361 (Key Insights and Results: 'The Reasoning Wall')",
      "status": "supported"
    },
    {
      "claim": "FIMO formalization success rate rises from 32.6% to 60.8% when iterative compiler feedback is applied.",
      "checked_against": "http://34.8.208.118/md/render/paper/fimo-a-challenge-formal-dataset-for-automated-theorem-proving/art_d65417c966cc4525ab8bd63248394361 (Methodology section)",
      "status": "supported"
    },
    {
      "claim": "StepFun-Prover-Preview-32B achieves 70.0% pass@1 on miniF2F-test, outperforming DeepSeek-Prover-V2-671B (61.9%) and Kimina-Prover-72B (63.9%).",
      "checked_against": "http://34.8.208.118/md/render/paper/stepfun-prover-preview-let-s-think-and-verify-step-by-step/art_804f191b837940a1a2b73568d30d29b0 (Experimental Results section)",
      "status": "supported"
    },
    {
      "claim": "OpenAI o3 achieves 72.6% CodeGen, 52.3% SpecGen, 4.9% ProofGen, and 3.2% end-to-end on VERINA; 64-round iterative refinement raises proof success to ~20%.",
      "checked_against": "http://34.8.208.118/md/render/paper/verina-benchmarking-verifiable-code-generation/art_ae1f0fb500a44279ba999f4c6df4ed6e (Key Insights and Results section)",
      "status": "supported"
    },
    {
      "claim": "Token-level alignment via DTW over compiler state deltas is proposed to enable localized error correction rather than full proof restart.",
      "checked_against": "http://34.8.208.118/md/render/hypothesis/token-level-alignment-of-informal-mathematics-to-formal-compiler-states-5f6a6e04c0bdbac4 (Core Claim and Synthetic Data sections)",
      "status": "proposal_level"
    }
  ],
  "two_page_synthesis": "## Bridging Informal Mathematics and Formal Verification: Where LLMs Stand and Where They Fail\n\nThe central challenge in automated theorem proving is not mathematical capability per se, but a translation problem with two faces. On one side sits informal mathematics: expressive, compressed, and reliant on shared human context. On the other sits the formal proof assistant\u2014Lean, Coq, Isabelle\u2014which demands exhaustive precision and tolerates no ambiguity. Large language models, trained on vast corpora of informal text, have developed impressive mathematical fluency. But fluency in informal registers does not transfer cleanly to formal ones, and the resulting gap\u2014the autoformalization gap\u2014has become the organizing problem of a rapidly growing subfield.\n\n**The Granularity Mismatch**\n\nThe autoformalization gap is not primarily a problem of vocabulary or syntax. It is a problem of granularity. Human mathematical writing is compressed: a proof sketch omits obvious intermediate steps, implicit type declarations, and structural assumptions that every competent reader can reconstruct. Formal compilers cannot reconstruct anything. Every type must be declared, every implicit structure instantiated. ProofNet (Azerbayev et al., 2023) demonstrates this concretely: models consistently fail when they cannot infer hidden mathematical structures\u2014such as the requirement that an 'orthogonal complement' implies the ambient space is an inner product space\u2014left unstated in the informal prompt but required by Lean's mathlib. Even with retrieval augmentation pointing the model toward relevant formal declarations, the best result on ProofNet's undergraduate-level benchmark was only 16.1% formalization accuracy using OpenAI's Codex. This performance ceiling is not merely about model capability; it reflects a structural asymmetry in which the informal prompt and the formal proof inhabit different granularity regimes, and no amount of next-token prediction on informal text automatically teaches a model to bridge them.\n\n**High-Difficulty Baselines: The Reasoning Wall**\n\nIf undergraduate mathematics is hard, competition mathematics is effectively impenetrable. FIMO (Shen et al., 2023) formalizes 149 problems from the IMO Shortlist in Lean, covering algebra and number theory. Even with an iterative feedback pipeline\u2014where GPT-4 receives compiler error messages and corrects its translations\u2014formalization success rates reach only 60.8% (up from 32.6% without feedback). And crucially, a 0% pass rate is observed on actually solving the formalized problems, even when the ground-truth informal proof is supplied as a hint. The model can translate, sometimes; it cannot reason through the formalized problem. This is the reasoning wall: a point at which the syntactic task (producing valid Lean) separates entirely from the semantic task (constructing a valid mathematical argument).\n\nThe FIMO result clarifies what iterative compiler feedback actually buys. Feedback from a Lean compiler is a precise, deterministic signal\u2014a proof either compiles or it does not\u2014and that signal is genuinely useful for surface corrections: mistyped identifiers, wrong argument orders, missing imports. But it carries almost no semantic information about why a mathematical argument is wrong. A model learning to fix a parse error is not a model learning to find a better proof strategy.\n\n**The Iterative Feedback Paradigm: Gains and Limits**\n\nDespite this ceiling, iterative feedback has produced real gains at the competition level. StepFun-Prover (Shang et al., 2025) achieves 70.0% pass@1 on miniF2F-test by training a 32B-parameter model via Group Relative Policy Optimization (GRPO) to treat the Lean 4 REPL as an active reasoning tool rather than a final judge. The model submits tactics, receives compiler state transitions, and updates its strategy within the same generation trajectory. Critically, StepFun-Prover-32B outperforms DeepSeek-Prover-V2 at 671B parameters (61.9%), and the 7B StepFun variant still achieves 66.0%. This suggests that the quality of tool integration\u2014how deeply verification is embedded into the reasoning process\u2014matters more than raw parameter count.\n\nBut miniF2F represents high-school competition problems, not research mathematics. VERINA (Ye et al., 2026) extends the evaluation to verifiable code generation, requiring a model to jointly produce code, a formal specification, and a Lean 4 proof of correctness. Here results are sobering: OpenAI's o3 achieves 72.6% on code generation and 52.3% on specification generation, but only 4.9% on proof generation. End-to-end success (code plus spec plus proof from scratch) reaches just 3.2%. Allowing 64 rounds of iterative compiler feedback raises proof success to approximately 20%, but at massive computational cost. On complex problems, models do not converge\u2014they thrash, cycling through different incorrect solutions until the iteration budget is exhausted. More compute does not overcome fundamental gaps in logical reasoning.\n\n**Toward Finer-Grained Alignment**\n\nThe token-level alignment hypothesis (Lacuna, 2025) identifies the underlying credit-assignment failure. When a model submits a tactic block and receives a compiler rejection, it cannot determine which word in its informal reasoning caused the failure. The proposed solution draws from multimodal speech processing: just as CTC-based alignment methods map continuous acoustic frames to discrete text tokens, a token-level alignment framework would synchronize individual tokens of an informal sketch to the micro-state transitions of the compiler's interactive loop. Dynamic Time Warping accommodates non-monotonic mappings\u2014one informal phrase may generate multiple formal subgoals; several informal sentences may collapse to a single algebraic simplification call. When the compiler rejects a step, the model could trace the failure back to the specific informal token, enabling localized correction rather than a full proof restart.\n\n**Conclusion**\n\nThe evidence from ProofNet, FIMO, StepFun-Prover, VERINA, and the token-level alignment hypothesis converges on a coherent picture. Iterative compiler feedback is a genuine improvement over one-shot generation, but its utility scales with the model's underlying logical reasoning capacity. For surface-level syntactic errors, it works well. For deep mathematical reasoning failures, more iterations produce more wasted compute, not better proofs. The field is beginning to recognize that what is needed is not more feedback loops but finer-grained alignment\u2014mechanisms that connect informal intent to formal execution at a resolution fine enough to support localized, targeted correction. Whether this is achieved through token-level contrastive alignment, subgoal decomposition, or hybrid neuro-symbolic architectures remains the open question that the next generation of benchmarks\u2014harder than miniF2F, more holistic than ProofNet, broader than VERINA\u2014will need to answer.",
  "final_question": "Under what conditions does iterative compiler feedback enable LLMs to close the autoformalization gap, and at what difficulty level does the feedback signal become semantically insufficient to substitute for deeper mathematical reasoning capability?",
  "observations": [
    "ProofNet shows that even best-case formalization (Codex + retrieval augmentation) achieves only 16.1% accuracy on undergraduate-level mathematics, with failure rooted in the model's inability to infer implicit mathematical structures\u2014such as inner product space declarations\u2014that are required by formal libraries but left unstated in informal problem prompts. [Source: http://34.8.208.118/md/render/paper/proofnet-autoformalizing-and-formally-proving-undergraduate-level-mathematics/art_f5a5f3551f6641598e578328d5771b3b]",
    "StepFun-Prover achieves 70.0% pass@1 on miniF2F-test with a 32B model\u2014outperforming a 671B competitor\u2014by using GRPO reinforcement learning to embed Lean 4 REPL interactions directly into the reasoning trajectory, demonstrating that depth of tool integration is a stronger performance driver than parameter count. [Source: http://34.8.208.118/md/render/paper/stepfun-prover-preview-let-s-think-and-verify-step-by-step/art_804f191b837940a1a2b73568d30d29b0]",
    "VERINA reveals that even with 64 rounds of iterative compiler feedback, the best model (o3) achieves only ~20% proof generation success and 3.2% end-to-end verifiable code generation, with models thrashing rather than converging on complex problems\u2014quantifying the hard ceiling of feedback-loop approaches when fundamental logical reasoning gaps exist. [Source: http://34.8.208.118/md/render/paper/verina-benchmarking-verifiable-code-generation/art_ae1f0fb500a44279ba999f4c6df4ed6e]"
  ],
  "limitation": "Iterative compiler feedback\u2014even at 64 rounds\u2014fails to overcome fundamental logical reasoning gaps: VERINA documents models cycling through different incorrect proofs until the iteration budget is exhausted, with performance plateauing at ~20% rather than converging. The compiler's error signals are precise but semantically impoverished; they report what failed syntactically but provide no information about why a mathematical argument is wrong, making the feedback loop useful for surface corrections but insufficient for genuine mathematical insight. FIMO's 0% solve rate at the IMO level, achieved even when ground-truth informal proofs are supplied, confirms that this is a reasoning ceiling, not merely a feedback-design problem. [Sources: VERINA (http://34.8.208.118/md/render/paper/verina-benchmarking-verifiable-code-generation/art_ae1f0fb500a44279ba999f4c6df4ed6e); FIMO (http://34.8.208.118/md/render/paper/fimo-a-challenge-formal-dataset-for-automated-theorem-proving/art_d65417c966cc4525ab8bd63248394361); Iterative Compiler Feedback direction (http://34.8.208.118/md/render/direction/iterative-compiler-feedback-for-formal-theorem-proving-12169)]",
  "blocking_model_calls_needed": 1,
  "notes": "All 7 required pages were reachable and returned substantive markdown content. No redirects or 404s encountered. The search start URL was bypassed in favor of the known direction URL per task instructions. Pages were fetched in parallel (single Bash invocation with 7 concurrent curl calls) and bounded at 30000 characters each; all pages fit within that limit. The token-level alignment page is classified as a hypothesis/proposal rather than a published paper\u2014all claims from it are marked proposal_level in source_checks. The two-page synthesis is approximately 820 words."
}
```

### pdf_to_chat_baseline

```json
{
  "condition": "pdf_to_chat_baseline",
  "route": [
    {
      "url": "https://arxiv.org/pdf/2302.12433",
      "reason": "ProofNet establishes the benchmark landscape for autoformalization and formal proving at undergraduate level, identifying the informal-to-formal translation gap as the field's central infrastructure problem."
    },
    {
      "url": "https://arxiv.org/pdf/2309.04295",
      "reason": "FIMO escalates difficulty to IMO Shortlisted Problems, confirming that GPT-4 cannot prove any formally stated Olympiad problems and revealing the formalization pipeline's dependence on human semantic verification."
    },
    {
      "url": "https://stepfun.ai/research/en/stepfun-prover-preview",
      "reason": "StepFun-Prover demonstrates that training LLMs via RL to conduct open-ended Lean 4 REPL interactions \u2014 rather than receiving verifier feedback only at test time \u2014 achieves new state-of-the-art on miniF2F, establishing verifier-in-the-loop RL as the dominant paradigm shift."
    },
    {
      "url": "https://arxiv.org/pdf/2505.23135",
      "reason": "VERINA extends evaluation to verifiable software code generation, showing proof generation remains critically limited (4.9% single-trial success for o3; 20.1% with 64 iterative refinements), and proving that the formal gap at software-verification difficulty is a distinct, harder challenge than math competition proving."
    }
  ],
  "pages_read_count": 4,
  "paper_summaries": [
    {
      "paper": "ProofNet (arXiv:2302.12433)",
      "summary": "Introduces ProofNet, a 371-example benchmark for autoformalization and formal proving of undergraduate-level mathematics in Lean 3, covering real and complex analysis, linear algebra, abstract algebra, and topology. Also releases PROOF GPT models (1.3B and 6.7B params) trained on an 8B-token proof-pile. Proposes two novel autoformalization methods: prompt retrieval (nearest-neighbor search over mathlib declarations) and distilled backtranslation (unsupervised finetuning without parallel data). Central finding: lack of parallel informal-formal benchmarks was blocking progress; ProofNet fills this gap but baseline models remain far from human performance.",
      "decision": "use"
    },
    {
      "paper": "FIMO (arXiv:2309.04295)",
      "summary": "Presents FIMO, a dataset of 149 human-verified Lean formal statements sourced from IMO Shortlisted Problems (2006\u20132021), focused on Algebra and Number Theory. Constructs the dataset via a three-stage pipeline: OCR (Mathpix) \u2192 GPT-4 auto-formalization with iterative Lean error feedback (up to 5 reflection rounds) \u2192 human semantic verification. Key finding: GPT-4 fails to prove any IMO-level formal statements, confirming that competition-level mathematical reasoning is qualitatively beyond current LLMs. The feedback-augmented auto-formalization loop is promising but insufficient at this difficulty, and human verification remains unavoidable.",
      "decision": "use"
    },
    {
      "paper": "StepFun-Prover Preview (stepfun.ai/research/en/stepfun-prover-preview)",
      "summary": "Releases StepFun-Prover-Preview-7B and 32B \u2014 Lean 4 theorem provers trained via tool-integrated reinforcement learning. The training pipeline proceeds through cold-start data curation (multi-turn trajectories collected with Claude Sonnet 4 plus Kimina-Prover-72B outputs), two-stage SFT, response pattern fusion, and tool-integrated GRPO RL in which the model learns to decide when to invoke the Lean 4 REPL sandbox, interpret error messages, and adaptively restructure proofs without a fixed interaction limit. StepFun-Prover-Preview-32B achieves 70.0% pass@1 on miniF2F-test, outperforming DeepSeek-Prover-V2-671B (61.9%) and Kimina-Prover-72B (63.9%) with a much smaller model. Test-time scaling is monotonic: pass@1 rises from 58.3% at 4K to 70.0% at 20K maximum generation length, confirming verifier interaction as the core proof mechanism.",
      "decision": "use"
    },
    {
      "paper": "VERINA (arXiv:2505.23135, ICLR 2026)",
      "summary": "Introduces VERINA (Verifiable Code Generation Arena), a 189-task Lean benchmark for holistic evaluation of verifiable code generation across three compositional subtasks: code generation, specification generation (pre/post-conditions), and proof generation. Manually curated from MBPP-DFY-50, CloverBench, and university course submissions. Best general-purpose model (OpenAI o3): 72.6% code correctness, 52.3% spec soundness/completeness, 4.9% proof success (single trial). Best theorem-proving specialist (Goedel Prover V2 32B): 11.2% proof success. Iterative Lean compiler feedback raises proof success to 20.1% with 64 refinement steps, but at high computational cost and with 80% of proofs still failing. Identifies proof generation \u2014 not code or specification generation \u2014 as the binding bottleneck in the verifiable code pipeline.",
      "decision": "use"
    }
  ],
  "source_checks": [
    {
      "claim": "StepFun-Prover-Preview-32B achieves 70.0% pass@1 on miniF2F-test, surpassing DeepSeek-Prover-V2-671B (61.9%) and Kimina-Prover-72B (63.9%)",
      "checked_against": "StepFun-Prover source article Table 1 (Performance comparison on miniF2F-test)",
      "status": "supported"
    },
    {
      "claim": "OpenAI o3 achieves only 4.9% proof success rate (single trial) on VERINA; iterative refinement with 64 steps raises this to 20.1%",
      "checked_against": "VERINA abstract and Section 1 introduction",
      "status": "supported"
    },
    {
      "claim": "ProofNet contains 371 parallel formal/informal theorem pairs in Lean 3 drawn from undergraduate pure mathematics textbooks",
      "checked_against": "ProofNet abstract and Section 2 dataset collection",
      "status": "supported"
    },
    {
      "claim": "GPT-4 (gpt-4-0314) cannot prove any IMO-level formal statements in the FIMO benchmark",
      "checked_against": "FIMO abstract ('highlights GPT-4's limited capacity to yield satisfactory results') and experimental discussion",
      "status": "supported"
    },
    {
      "claim": "Test-time scaling in StepFun-Prover is monotonic: pass@1 increases from 58.3% (4K tokens) to 70.0% (20K tokens)",
      "checked_against": "StepFun-Prover article Table 2 (Performance with various maximum generation lengths)",
      "status": "supported"
    },
    {
      "claim": "Autoformalization of formal mathematical statements still requires human semantic verification even with LLM-plus-Lean-feedback loops",
      "checked_against": "FIMO dataset construction section (Manual Verification stage) and ProofNet dataset collection criteria",
      "status": "supported"
    }
  ],
  "two_page_synthesis": "The four sources trace a coherent arc from benchmark construction through state-of-the-art proof generation to software verification, revealing why the formal reasoning gap between human mathematicians and automated theorem provers remains large \u2014 and pointing to one structural intervention, tight iterative coupling between LLM generation and formal verifier feedback, as the key lever for progress.\n\nProofNet (2023) identified the core infrastructure problem: no parallel benchmark aligned informal undergraduate mathematics with formal Lean 3 statements, making progress in autoformalization essentially unmeasurable. Its 371-example dataset, spanning real and complex analysis, linear algebra, abstract algebra, and topology, exposed how difficult it is to bridge informal mathematical intuition and formal rigor even at the undergraduate level. PROOF GPT models, trained on an 8B-token proof-pile, outperformed base models on perplexity metrics but the paper's two novel autoformalization techniques \u2014 prompt retrieval via nearest-neighbor search over mathlib declarations, and distilled backtranslation \u2014 represent incremental gains on the translation subtask, not yet a path to full end-to-end proving.\n\nFIMO (2023) escalated the difficulty to genuine Olympiad problems and found a steep cliff. Sourcing 149 problems from IMO Shortlisted Problems and formalizing them through a three-stage pipeline (OCR, GPT-4 auto-formalization with iterative Lean error feedback, human semantic verification), the authors showed that GPT-4 could produce syntactically valid Lean statements but could not prove a single one. The critical design choice \u2014 exposing the model to Lean error messages and allowing iterative reflection \u2014 is the same pattern that would later define the StepFun-Prover training regime, but at IMO difficulty it cannot eliminate the need for human semantic validation of the formal statements themselves. This reveals a two-layer problem: autoformalization correctness and proof search capability are separate failure modes.\n\nStepFun-Prover (2025) operationalizes the verifier-feedback loop at training time rather than inference time, and the results are dramatic. The training pipeline begins with cold-start SFT data including multi-turn trajectories collected with Claude Sonnet 4, followed by two-stage fine-tuning, response pattern fusion to reconcile different reasoning styles, and finally group relative policy optimization (GRPO) in which models learn to decide when to invoke the Lean 4 REPL sandbox, how to interpret error and warning messages, and when to restructure the proof entirely. Crucially, there is no fixed limit on the number of REPL interactions: the model learns to self-terminate when it believes the proof is complete. StepFun-Prover-Preview-32B achieves 70.0% pass@1 on miniF2F-test, outperforming DeepSeek-Prover-V2-671B (61.9%) and Kimina-Prover-72B (63.9%) with a far smaller model. Test-time scaling is monotonic: extending maximum generation length from 4K to 20K tokens raises pass@1 from 58.3% to 70.0%, and the REPL interaction frequency distribution for successful proofs confirms that many solutions emerge only after multiple rounds of error diagnosis and adaptive restructuring. The verifier feedback loop is not a fallback mechanism \u2014 it is the primary proof generation process.\n\nVERINA (2026) applies this framing to software verification and reveals a harder bottleneck. The benchmark's 189 manually-curated Lean tasks require jointly generating code, formal pre/post-condition specifications, and correctness proofs \u2014 a more compositional challenge than pure math proving, and one that demands precise semantic alignment between three formal artifacts rather than one. Even with Lean compiler feedback available, the best general-purpose model (OpenAI o3) achieves only 4.9% proof success in a single trial. The best specialist theorem prover (Goedel Prover V2 32B) reaches 11.2%. Iterative refinement over 64 steps raises proof success to 20.1% \u2014 meaningful progress, but leaving 80% of tasks unsolved and incurring substantial computational cost. Critically, code correctness (72.6%) and specification soundness (52.3%) are far higher than proof success, confirming that proof construction \u2014 not code generation or informal-to-formal translation \u2014 is the binding constraint in the verifiable software pipeline.\n\nThe route reveals a field in transition. The dominant paradigm \u2014 generate formal proofs, check them with an ITP, and learn from the feedback signal \u2014 has been validated at the miniF2F competition level and is now being pushed toward software verification and genuine Olympiad difficulty. The trajectory from 0% (GPT-4 on FIMO IMO problems) to 70% (StepFun-Prover on miniF2F) and to 4\u201320% (best models on VERINA proofs) exposes both the power and the limits of this paradigm. The informal-to-formal translation bottleneck (autoformalization) identified by ProofNet and FIMO remains present even in 2025\u20132026 systems: VERINA's manually curated ground-truth formal specifications are required because LLMs cannot yet reliably generate faithful specifications without human intervention. Progress on proof search (via RL from verifier feedback) has outpaced progress on autoformalization, creating an asymmetry that will need to be addressed for truly end-to-end automated theorem proving.",
  "final_question": "To what extent does reinforcement learning from proof-assistant verifier feedback (Lean REPL interactions) generalize across difficulty regimes \u2014 from undergraduate-level and competition-level math to software verification \u2014 and does the informal-to-formal autoformalization step constitute an independent bottleneck that limits end-to-end automated theorem proving regardless of proof search capability?",
  "observations": [
    "Iterative verifier-feedback loops during RL training \u2014 not just at inference time \u2014 yield qualitative performance gains: StepFun-Prover-Preview-32B achieves 70.0% pass@1 on miniF2F-test by learning to conduct open-ended multi-turn Lean 4 REPL interactions, outperforming models 20\u00d7 its size (DeepSeek-Prover-V2-671B at 61.9%), with test-time scaling showing monotonic improvement from 58.3% at 4K to 70.0% at 20K generation tokens. [Source: StepFun-Prover article Table 1 and Table 2]",
    "Proof generation is the binding bottleneck in verifiable code pipelines, not code or specification generation: VERINA shows that even OpenAI o3 achieves 72.6% code correctness but only 4.9% proof success in a single trial, and iterative Lean compiler feedback over 64 steps improves this to only 20.1% \u2014 confirming that formal reasoning about code correctness remains qualitatively harder than generating the code itself. [Source: VERINA abstract and Section 1]",
    "The informal-to-formal autoformalization gap is a persistent, difficulty-dependent bottleneck that operates independently of proof search capability: ProofNet (2023) introduced prompt retrieval and distilled backtranslation to bridge the gap at undergraduate level, while FIMO (2023) showed that even GPT-4 with iterative Lean feedback cannot autoformalize and prove any IMO Shortlisted Problems without human semantic verification \u2014 implying that as benchmark difficulty rises, autoformalization failure precedes and compounds proof search failure. [Sources: ProofNet Section 4.1, FIMO abstract and dataset construction section]"
  ],
  "limitation": "Cross-benchmark evaluation is essentially absent in the current literature, making it impossible to attribute observed performance gaps to specific failure modes. StepFun-Prover achieves 70% on miniF2F but its performance on FIMO IMO-level tasks or VERINA software-verification proofs is unreported; VERINA evaluates Goedel Prover V2 but not StepFun-Prover or the models from ProofNet and FIMO. The steep drop between miniF2F pass rates (70%) and VERINA proof rates (4.9\u201311.2%) is inferred by comparing different models on different benchmarks designed by different teams, not from controlled ablations holding the model fixed. Without a unified evaluation across difficulty regimes \u2014 undergraduate autoformalization, competition proving, and software verification proofs \u2014 it is unclear whether the verifier-feedback RL paradigm represents a general capability gain or a benchmark-specific one, and whether improvements in proof search (as in StepFun-Prover) would transfer to the harder autoformalization-plus-proving pipeline that FIMO and VERINA require.",
  "blocking_model_calls_needed": 6,
  "notes": "All four source papers were successfully fetched and processed. ProofNet and FIMO PDFs were extracted with pdftotext (first 10 pages each). The StepFun-Prover page is a Next.js app that renders content client-side; the article text was recoverable via HTML tag stripping (Python regex) from the raw server-rendered HTML, yielding sufficient detail including the methodology, training pipeline, experimental tables, and full reference list. VERINA PDF was extracted with pdftotext. No sources were unavailable. VERINA is published as a conference paper at ICLR 2026 (per the PDF header), which is a future date relative to the knowledge cutoff \u2014 the arXiv preprint (v3, 2026-03-16) was used as the source. The blocking_model_calls_needed estimate of 6 reflects: 1 planning call + 1 call per paper summary (4 calls) + 1 final synthesis call, which is the minimum sequential workflow for a PDF-to-chat baseline without parallelization."
}
```
