# Claude Code Workflow Evaluation

Run directory: `outputs/2026-05-08`

This is an operational trace, not a human-subjects study. Both conditions used Claude Code in non-interactive mode with the same model and a fixed task.

## Summary Table

| Condition | Wall time | Claude turns | Tool calls | Estimated blocking model calls | Cost | Outcome |
|---|---:|---:|---:|---:|---:|---|
| `lacuna_navigation` | 55.7s | 7 | 6 | 1 | $0.139 | Can token-level alignment between informal mathematical language and formal compiler goal states—trained on synthetic... |
| `pdf_to_chat_baseline` | 100.5s | 8 | 7 | 6 | $0.277 | Across Lean-based benchmarks of increasing difficulty — from undergraduate mathematics (ProofNet) to IMO competition ... |

## Tool Commands

### lacuna_navigation

- `curl -sS "http://34.8.208.118/md/render/direction/the-autoformalization-gap-in-theorem-proving-23293" | head -c 20000`
- `curl -sS "http://34.8.208.118/md/render/direction/iterative-compiler-feedback-for-formal-theorem-proving-12169" | head -c 20000`
- `curl -sS "http://34.8.208.118/md/render/paper/proofnet-autoformalizing-and-formally-proving-undergraduate-level-mathematics/art_f5a5f3551f6641598e578328d5771b3b" | head -c 10000`
- `curl -sS "http://34.8.208.118/md/render/paper/stepfun-prover-preview-let-s-think-and-verify-step-by-step/art_804f191b837940a1a2b73568d30d29b0" | head -c 10000`
- `curl -sS "http://34.8.208.118/md/render/paper/verina-benchmarking-verifiable-code-generation/art_ae1f0fb500a44279ba999f4c6df4ed6e" | head -c 10000`
- `curl -sS "http://34.8.208.118/md/render/hypothesis/token-level-alignment-of-informal-mathematics-to-formal-compiler-states-5f6a6e04c0bdbac4" | head -c 10000`

### pdf_to_chat_baseline

- `curl -sS "https://arxiv.org/abs/2302.12433" | head -c 30000`
- `curl -sS "https://arxiv.org/abs/2309.04295" | head -c 30000`
- `curl -sS "https://stepfun.ai/research/en/stepfun-prover-preview" | head -c 30000`
- `curl -sS "https://stepfun.ai/research/en/stepfun-prover-preview" | head -c 30000 | grep -o '"description"[^,}]*' | head -5; curl -sS "https://stepfun.ai/research/en/stepfun-prover-preview" | head -c 30000 | grep -o 'content="[^"]*[Pp]rover[^"]*"' | head -5`
- `curl -sS "https://stepfun.ai/research/en/stepfun-prover-preview" 2>/dev/null | grep -oP 'content="[^"]{40,}"' | head -20`
- `curl -sS "https://stepfun.ai/research/en/stepfun-prover-preview" 2>/dev/null | grep -oE 'content="[^"]{40,}"' | head -20`
- `curl -sS "https://arxiv.org/abs/2505.23135" | head -c 30000`

## Final Outputs

### lacuna_navigation

```json
{
  "condition": "lacuna_navigation",
  "route": [
    {
      "url": "http://34.8.208.118/md/render/direction/the-autoformalization-gap-in-theorem-proving-23293",
      "reason": "Entry point establishing the core tension: LLMs have strategic intuition but formal compilers demand absolute precision; introduces granularity mismatch and semantic misalignment concepts"
    },
    {
      "url": "http://34.8.208.118/md/render/direction/iterative-compiler-feedback-for-formal-theorem-proving-12169",
      "reason": "Deepens understanding of closed-loop REPL-based approaches and identifies the key limitation: iterative feedback fails when the underlying logical reasoning gap is too large"
    },
    {
      "url": "http://34.8.208.118/md/render/paper/proofnet-autoformalizing-and-formally-proving-undergraduate-level-mathematics/art_f5a5f3551f6641598e578328d5771b3b",
      "reason": "Concrete benchmark evidence: best formalization accuracy at 16.1% with retrieval augmentation; documents the implicit-hypothesis failure mode"
    },
    {
      "url": "http://34.8.208.118/md/render/paper/stepfun-prover-preview-let-s-think-and-verify-step-by-step/art_804f191b837940a1a2b73568d30d29b0",
      "reason": "State-of-the-art REPL-integrated RL result (70.0% pass@1 on miniF2F); shows multi-turn agentic interaction outperforms larger one-shot models"
    },
    {
      "url": "http://34.8.208.118/md/render/paper/verina-benchmarking-verifiable-code-generation/art_ae1f0fb500a44279ba999f4c6df4ed6e",
      "reason": "Provides the sharpest quantification of the proof-generation bottleneck: o3 achieves 72.6% CodeGen but only 4.9% ProofGen; iterative refinement plateaus at ~20%"
    },
    {
      "url": "http://34.8.208.118/md/render/hypothesis/token-level-alignment-of-informal-mathematics-to-formal-compiler-states-5f6a6e04c0bdbac4",
      "reason": "Research hypothesis proposing token-level alignment via DTW/CTC techniques to solve the credit-assignment problem during iterative compiler feedback"
    }
  ],
  "branch_decisions": [
    "Skipped FIMO paper (art_d65417c966cc4525ab8bd63248394361) after it was sufficiently summarized in the hypothesis page; VERINA and ProofNet provided stronger quantitative grounding",
    "Did not follow author or search links from direction pages; stayed on the recommended route to keep scope tight",
    "Used /md variants throughout for clean plain-text parsing rather than HTML search results"
  ],
  "source_checks": [
    {
      "claim": "ProofNet: best formalization accuracy with retrieval augmentation was 16.1% (Codex + Prompt Retrieval), up from 13.4% baseline; typecheck rate jumped from 23.7% to 45.2%",
      "checked_against": "http://34.8.208.118/md/render/paper/proofnet-autoformalizing-and-formally-proving-undergraduate-level-mathematics/art_f5a5f3551f6641598e578328d5771b3b",
      "status": "supported"
    },
    {
      "claim": "StepFun-Prover-Preview-32B achieves 70.0% pass@1 on miniF2F-test, outperforming DeepSeek-Prover-V2-671B (61.9%) using tool-integrated RL with Lean 4 REPL",
      "checked_against": "http://34.8.208.118/md/render/paper/stepfun-prover-preview-let-s-think-and-verify-step-by-step/art_804f191b837940a1a2b73568d30d29b0",
      "status": "supported"
    },
    {
      "claim": "VERINA: o3 scores 72.6% CodeGen, 52.3% SpecGen, but only 4.9% ProofGen; end-to-end (code+spec+proof) success rate drops to 3.2%; 64 rounds of iterative feedback raises proof success to ~20% but hits a wall",
      "checked_against": "http://34.8.208.118/md/render/paper/verina-benchmarking-verifiable-code-generation/art_ae1f0fb500a44279ba999f4c6df4ed6e",
      "status": "supported"
    },
    {
      "claim": "Token-level DTW alignment of informal math tokens to formal compiler goal-state deltas can solve the credit-assignment problem in iterative proving",
      "checked_against": "http://34.8.208.118/md/render/hypothesis/token-level-alignment-of-informal-mathematics-to-formal-compiler-states-5f6a6e04c0bdbac4",
      "status": "proposal_level"
    },
    {
      "claim": "Models fail autoformalization because they cannot infer hidden mathematical structures (e.g., inner product spaces) left implicit in informal prompts",
      "checked_against": "http://34.8.208.118/md/render/direction/the-autoformalization-gap-in-theorem-proving-23293 (citing ProofNet)",
      "status": "supported"
    }
  ],
  "final_question": "Can token-level alignment between informal mathematical language and formal compiler goal states\u2014trained on synthetic proof-execution traces\u2014reduce the iterative-refinement ceiling in LLM-based theorem provers beyond the ~20% plateau observed on complex verifiable code generation tasks?",
  "observations": [
    "Retrieval-augmented autoformalization (Codex + Prompt Retrieval) improves Lean typecheck rates from 23.7% to 45.2% and formalization accuracy from 13.4% to 16.1% on ProofNet, but overall accuracy remains low because models cannot infer implicit mathematical structures (e.g., inner product space declarations) required by the formal library. [ProofNet](http://34.8.208.118/md/render/paper/proofnet-autoformalizing-and-formally-proving-undergraduate-level-mathematics/art_f5a5f3551f6641598e578328d5771b3b)",
    "Tool-integrated reinforcement learning with real-time Lean 4 REPL feedback (StepFun-Prover-Preview-32B) achieves 70.0% pass@1 on miniF2F-test\u2014surpassing models with 20\u00d7 more parameters\u2014demonstrating that quality of agentic interaction with the formal verifier is a stronger performance driver than raw model scale. [StepFun-Prover](http://34.8.208.118/md/render/paper/stepfun-prover-preview-let-s-think-and-verify-step-by-step/art_804f191b837940a1a2b73568d30d29b0)",
    "On VERINA's end-to-end verifiable code generation benchmark (Lean 4), the best model (o3) achieves only 3.2% full-pipeline success and 4.9% ProofGen in isolation; 64 rounds of iterative compiler feedback raises proof success to only ~20%, after which additional compute yields diminishing returns, indicating that iterative feedback alone cannot compensate for fundamental logical reasoning gaps. [VERINA](http://34.8.208.118/md/render/paper/verina-benchmarking-verifiable-code-generation/art_ae1f0fb500a44279ba999f4c6df4ed6e)"
  ],
  "limitation": "Iterative compiler feedback\u2014the dominant closed-loop paradigm\u2014suffers a hard ceiling on complex problems: when a model lacks the underlying deductive capability to bridge a logical gap, no amount of REPL iterations resolves it. The compiler's error messages are designed for rigorous verification, not pedagogy, so they provide technically precise but semantically opaque signals. Models enter thrashing loops, cycling through incorrect proof variants until hitting iteration limits, rather than converging. This is documented quantitatively in VERINA (plateau at ~20% with 64 feedback rounds) and qualitatively in the iterative-compiler-feedback direction page (citing VERINA and the Tsumura problem failures).",
  "blocking_model_calls_needed": 1,
  "notes": "All six recommended pages were successfully fetched and parsed. The hypothesis page is a Lacuna-native research proposal (not an external paper), so its core claim is marked proposal_level. The FIMO paper page was referenced by the hypothesis but not independently fetched; its existence is confirmed via cross-references. Route converges on a clear research gap: the credit-assignment problem during iterative refinement is the proximate cause of the ~20% ceiling, and token-level alignment (hypothesis page) is the most concrete proposed solution found on these Lacuna pages."
}
```

### pdf_to_chat_baseline

```json
{
  "condition": "pdf_to_chat_baseline",
  "route": [
    {
      "url": "https://arxiv.org/abs/2302.12433",
      "reason": "ProofNet establishes a Lean-based undergraduate mathematics benchmark (371 examples) with autoformalization baselines \u2014 anchors the lower difficulty tier of the landscape."
    },
    {
      "url": "https://arxiv.org/abs/2309.04295",
      "reason": "FIMO escalates to IMO-level difficulty (149 Lean problems) and shows GPT-4 fails substantially, revealing the capability gap between undergraduate and competition-level ATP."
    },
    {
      "url": "https://stepfun.ai/research/en/stepfun-prover-preview",
      "reason": "StepFun-Prover Preview represents an industry-deployed LLM built specifically for formal theorem proving with tool-integrated reasoning \u2014 intended to illustrate the state of the art on the systems side."
    },
    {
      "url": "https://arxiv.org/abs/2505.23135",
      "reason": "VERINA extends the benchmark lens to verifiable code generation in Lean, quantifying that even the best LLM (OpenAI o3) achieves only 4.9% proof success, making proof generation the dominant bottleneck."
    }
  ],
  "paper_summaries": [
    {
      "paper": "ProofNet (arXiv:2302.12433)",
      "summary": "Introduces a benchmark of 371 Lean 3 formal theorem-proving examples drawn from undergraduate pure mathematics textbooks (real/complex analysis, linear algebra, abstract algebra, topology). Each example includes a formal Lean 3 statement, a natural-language statement, and a natural-language proof. Reports in-context learning baselines for autoformalization and proposes two new methods \u2014 prompt retrieval and distilled backtranslation \u2014 as improvements. Intended as a driver for autoformalization and ATP research.",
      "decision": "use"
    },
    {
      "paper": "FIMO (arXiv:2309.04295)",
      "summary": "Presents 149 Lean-formalized problem statements from IMO Shortlisted Problems, paired with informal descriptions and LaTeX proofs. Targets a much harder difficulty tier than ProofNet. Initial GPT-4 experiments show the model largely fails to produce valid Lean proofs at this level, with the authors concluding there is a substantial gap before satisfactory IMO-level ATP is achievable.",
      "decision": "use"
    },
    {
      "paper": "StepFun-Prover Preview (stepfun.ai)",
      "summary": "A commercial/research preview of a large language model from StepFun AI designed specifically for formal theorem proving using 'tool-integrated reasoning'. The research page is a client-side-rendered JavaScript SPA (Next.js); only the meta description was accessible via curl \u2014 no benchmark numbers, methodology details, or comparisons could be extracted from the raw HTML.",
      "decision": "use"
    },
    {
      "paper": "VERINA (arXiv:2505.23135)",
      "summary": "Introduces a benchmark of 189 manually curated Lean coding tasks targeting verifiable code generation \u2014 jointly evaluating code, formal specification, and proof generation. Evaluated state-of-the-art LLMs including OpenAI o3 (best performer): 72.6% code correctness, 52.3% specification soundness/completeness, but only 4.9% proof success rate (one trial per task). Concludes that proof generation is the critical bottleneck for LLM-based formal verification.",
      "decision": "use"
    }
  ],
  "source_checks": [
    {
      "claim": "ProofNet contains 371 examples at undergraduate level in Lean 3",
      "checked_against": "https://arxiv.org/abs/2302.12433 abstract",
      "status": "supported"
    },
    {
      "claim": "FIMO contains 149 IMO-level Lean problems and GPT-4 underperforms significantly",
      "checked_against": "https://arxiv.org/abs/2309.04295 abstract",
      "status": "supported"
    },
    {
      "claim": "StepFun-Prover uses tool-integrated reasoning for formal theorem proving",
      "checked_against": "https://stepfun.ai/research/en/stepfun-prover-preview meta description only (SPA, body inaccessible)",
      "status": "proposal_level"
    },
    {
      "claim": "Best LLM achieves only 4.9% proof success on VERINA benchmark",
      "checked_against": "https://arxiv.org/abs/2505.23135 abstract (OpenAI o3, one trial per task)",
      "status": "supported"
    }
  ],
  "final_question": "Across Lean-based benchmarks of increasing difficulty \u2014 from undergraduate mathematics (ProofNet) to IMO competition problems (FIMO) to verifiable code generation (VERINA) \u2014 what explains the disproportionate failure of LLMs specifically at proof generation compared to code or specification generation, and can tool-integrated reasoning approaches (e.g., StepFun-Prover) close this gap?",
  "observations": [
    "Benchmarks now span a well-defined difficulty ladder for Lean-based ATP: ProofNet (371 undergraduate problems, 2023) anchors the lower end, FIMO (149 IMO problems, 2023) establishes a hard middle tier, and VERINA (189 verification tasks, 2025) captures applied software-verification demands \u2014 each revealing progressively worse LLM proof success rates. [ProofNet: https://arxiv.org/abs/2302.12433; FIMO: https://arxiv.org/abs/2309.04295; VERINA: https://arxiv.org/abs/2505.23135]",
    "Proof generation is catastrophically harder than code generation for LLMs: on VERINA, OpenAI o3 achieves 72.6% code correctness but only 4.9% proof success (one trial per task), a ~15\u00d7 gap. FIMO similarly found GPT-4 unable to close IMO-level proofs. This asymmetry suggests the formal deductive step \u2014 not natural-language understanding or code synthesis \u2014 is the dominant bottleneck. [VERINA: https://arxiv.org/abs/2505.23135; FIMO: https://arxiv.org/abs/2309.04295]",
    "Autoformalization (translating natural-language math into formal statements) has matured as a sub-task: ProofNet's prompt-retrieval and distilled backtranslation methods improved statement formalization, and VERINA explicitly separates specification generation (52.3% for o3) from proof generation (4.9%), confirming that writing a formal specification is far more tractable than constructing a verified proof. [ProofNet: https://arxiv.org/abs/2302.12433; VERINA: https://arxiv.org/abs/2505.23135]"
  ],
  "limitation": "StepFun-Prover Preview is the only industry-built prover in this route, but its research page is a JavaScript SPA that yields no body content via curl \u2014 only a meta description ('large language model designed for formal theorem proving through tool-integrated reasoning') was accessible. As a result, no benchmark numbers, architectural details, or comparisons to ProofNet/FIMO/VERINA tasks could be verified from the source, leaving the claim about tool-integrated reasoning as proposal-level rather than empirically supported within this trace.",
  "blocking_model_calls_needed": 6,
  "notes": "blocking_model_calls_needed = 1 (planning/question scoping) + 4 (one summary per paper) + 1 (final synthesis) = 6. StepFun-Prover page returned only meta description due to client-side rendering; content recorded as content-limited but not unavailable (HTTP 200). All arXiv abstracts returned full text via server-side rendered HTML. VERINA was submitted May 2025, last revised March 2026 (v3), making it the most recent source. No Lacuna pages, /md routes, or Lacuna-generated summaries were used."
}
```
