# Claude Code Workflow Evaluation

Run directory: `outputs/2026-05-08_pdf10_notes_baseline`

This is an operational trace, not a human-subjects study. Both conditions used Claude Code in non-interactive mode with the same model and a fixed task.

## Summary Table

| Condition | Wall time | Claude turns | Tool calls | Tool text | Output tokens | Estimated blocking model calls | Cost | Outcome |
|---|---:|---:|---:|---:|---:|---:|---:|---|
| `lacuna_navigation` | 85.5s | 8 | 7 | 41.5KB | 4919 | 1 | $0.171 | At what granularity of compiler feedback — tactic-block, subgoal, or token-level — does iterative LLM-REPL interactio... |
| `pdf_to_chat_baseline` | 289.2s | 28 | 27 | 167.3KB | 14135 | 6 | $0.695 | Does integrating real-time formal verifier feedback into the LLM training loop via reinforcement learning—rather than... |

## Tool Commands

### lacuna_navigation

- `curl -sS "http://34.8.208.118/md/render/direction/the-autoformalization-gap-in-theorem-proving-23293" | head -c 30000`
- `curl -sS "http://34.8.208.118/md/render/direction/iterative-compiler-feedback-for-formal-theorem-proving-12169" | head -c 30000`
- `curl -sS "http://34.8.208.118/md/render/paper/proofnet-autoformalizing-and-formally-proving-undergraduate-level-mathematics/art_f5a5f3551f6641598e578328d5771b3b" | head -c 30000`
- `curl -sS "http://34.8.208.118/md/render/paper/fimo-a-challenge-formal-dataset-for-automated-theorem-proving/art_d65417c966cc4525ab8bd63248394361" | head -c 30000`
- `curl -sS "http://34.8.208.118/md/render/paper/stepfun-prover-preview-let-s-think-and-verify-step-by-step/art_804f191b837940a1a2b73568d30d29b0" | head -c 30000`
- `curl -sS "http://34.8.208.118/md/render/paper/verina-benchmarking-verifiable-code-generation/art_ae1f0fb500a44279ba999f4c6df4ed6e" | head -c 30000`
- `curl -sS "http://34.8.208.118/md/render/hypothesis/token-level-alignment-of-informal-mathematics-to-formal-compiler-states-5f6a6e04c0bdbac4" | head -c 30000`

### pdf_to_chat_baseline

- `curl -L -sS -o /private/tmp/lacuna_workflow_eval/proofnet.pdf https://arxiv.org/pdf/2302.12433 && echo "Downloaded proofnet.pdf"`
- `pdftotext -f 1 -l 10 -layout /private/tmp/lacuna_workflow_eval/proofnet.pdf - | head -c 50000`
- `pdftotext -f 1 -l 10 -layout /private/tmp/lacuna_workflow_eval/proofnet.pdf - | head -c 50000 > /private/tmp/lacuna_workflow_eval/proofnet_text.txt && wc -c /private/tmp/lacuna_workflow_eval/proofnet_text.txt`
- `cat /private/tmp/lacuna_workflow_eval/proofnet_text.txt`
- `sed -n '1,200p' /private/tmp/lacuna_workflow_eval/proofnet_text.txt`
- `sed -n '200,400p' /private/tmp/lacuna_workflow_eval/proofnet_text.txt`
- `cat > /private/tmp/lacuna_workflow_eval/proofnet_note.md << 'EOF'
<omitted 4985 chars>`
- `cat >> /private/tmp/lacuna_workflow_eval/proofnet_note.md << 'EOF'
<omitted 649 chars>`
- `curl -L -sS -o /private/tmp/lacuna_workflow_eval/fimo.pdf https://arxiv.org/pdf/2309.04295 && echo "Downloaded fimo.pdf"`
- `pdftotext -f 1 -l 10 -layout /private/tmp/lacuna_workflow_eval/fimo.pdf - | head -c 50000`
- `pdftotext -f 1 -l 10 -layout /private/tmp/lacuna_workflow_eval/fimo.pdf - | head -c 50000 > /private/tmp/lacuna_workflow_eval/fimo_text.txt && sed -n '1,350p' /private/tmp/lacuna_workflow_eval/fimo_text.txt`
- `sed -n '1,150p' /private/tmp/lacuna_workflow_eval/fimo_text.txt 2>/dev/null || head -150 /private/tmp/lacuna_workflow_eval/fimo_text.txt`
- `sed -n '150,350p' /private/tmp/lacuna_workflow_eval/fimo_text.txt`
- `sed -n '350,500p' /private/tmp/lacuna_workflow_eval/fimo_text.txt`
- `cat > /private/tmp/lacuna_workflow_eval/fimo_note.md << 'EOF'
<omitted 5619 chars>`
- `curl -L -sS "https://stepfun.ai/research/en/stepfun-prover-preview" | head -c 50000`
- `curl -L -sS "https://stepfun.ai/research/en/stepfun-prover-preview" | python3 -c "
import sys, re, html
content = sys.stdin.read()
# Remove script and style tags
content = re.sub(r'<script[^>]*>.*?</script>', '', content, flags=re.DOTALL)
content = re.sub(r'<style[^>]*>.*?</style>', '', content, flags=re.DOTALL)
# Extract text from HTML
text = re.sub(r'<[^>]+>', ' ', content)
text = html.unescape(text)
text = re.sub(r'\s+', ' ', text).strip()
print(text[:20000])
"`
- `cat > /private/tmp/lacuna_workflow_eval/stepfun_note.md << 'EOF'
<omitted 6149 chars>`
- `curl -L -sS -o /private/tmp/lacuna_workflow_eval/verina.pdf https://arxiv.org/pdf/2505.23135 && echo "Downloaded verina.pdf"`
- `pdftotext -f 1 -l 10 -layout /private/tmp/lacuna_workflow_eval/verina.pdf - | head -c 50000`
- `pdftotext -f 1 -l 10 -layout /private/tmp/lacuna_workflow_eval/verina.pdf - | head -c 50000 > /private/tmp/lacuna_workflow_eval/verina_text.txt && sed -n '1,350p' /private/tmp/lacuna_workflow_eval/verina_text.txt`
- `sed -n '350,550p' /private/tmp/lacuna_workflow_eval/verina_text.txt`
- `sed -n '550,720p' /private/tmp/lacuna_workflow_eval/verina_text.txt`
- `sed -n '720,850p' /private/tmp/lacuna_workflow_eval/verina_text.txt`
- `wc -l /private/tmp/lacuna_workflow_eval/verina_text.txt && tail -100 /private/tmp/lacuna_workflow_eval/verina_text.txt`
- `cat > /private/tmp/lacuna_workflow_eval/verina_note.md << 'EOF'
<omitted 6450 chars>`
- `ls -la /private/tmp/lacuna_workflow_eval/*.md /private/tmp/lacuna_workflow_eval/*.pdf 2>/dev/null`

## Final Outputs

### lacuna_navigation

```json
{
  "condition": "lacuna_navigation",
  "route": [
    {
      "url": "http://34.8.208.118/md/render/direction/the-autoformalization-gap-in-theorem-proving-23293",
      "reason": "Entry point: establishes the core autoformalization problem \u2014 LLMs have strategic intuition but fail formal compiler precision; introduced key concepts of granularity mismatch, neuro-symbolic bridges, and semantic misalignment."
    },
    {
      "url": "http://34.8.208.118/md/render/direction/iterative-compiler-feedback-for-formal-theorem-proving-12169",
      "reason": "Second direction page: details the REPL-based closed-loop feedback paradigm as the dominant response to one-shot fragility; introduces the 'semantic wall' limitation where LLMs thrash rather than converge on hard problems."
    },
    {
      "url": "http://34.8.208.118/md/render/paper/proofnet-autoformalizing-and-formally-proving-undergraduate-level-mathematics/art_f5a5f3551f6641598e578328d5771b3b",
      "reason": "Empirical grounding: ProofNet benchmark (371 undergraduate-level problems); Codex baseline formalization 13.4%, retrieval-augmented 16.1%; documented implicit-hypothesis failure mode (inner product space example)."
    },
    {
      "url": "http://34.8.208.118/md/render/paper/fimo-a-challenge-formal-dataset-for-automated-theorem-proving/art_d65417c966cc4525ab8bd63248394361",
      "reason": "Difficulty ceiling: 149 IMO-Shortlist problems in Lean; GPT-4 achieves 0% proof success; iterative feedback pipeline raises formalization success from 32.6% to 60.8%, establishing the gap between formalization and proof generation."
    },
    {
      "url": "http://34.8.208.118/md/render/paper/stepfun-prover-preview-let-s-think-and-verify-step-by-step/art_804f191b837940a1a2b73568d30d29b0",
      "reason": "State-of-the-art result: StepFun-Prover-32B achieves 70.0% pass@1 on miniF2F via GRPO RL with Lean 4 REPL tool integration, outperforming 671B models \u2014 demonstrates feedback quality dominates parameter count."
    },
    {
      "url": "http://34.8.208.118/md/render/paper/verina-benchmarking-verifiable-code-generation/art_ae1f0fb500a44279ba999f4c6df4ed6e",
      "reason": "Critical bottleneck quantification: o3 achieves 72.6% CodeGen but only 4.9% ProofGen; end-to-end (code+spec+proof) success is 3.2%; 64-round iterative refinement only reaches ~20% ProofGen at massive compute cost."
    },
    {
      "url": "http://34.8.208.118/md/render/hypothesis/token-level-alignment-of-informal-mathematics-to-formal-compiler-states-5f6a6e04c0bdbac4",
      "reason": "Proposed solution: token-to-compiler-state alignment using CTC-inspired contrastive loss and Dynamic Time Warping to address credit assignment in iterative proof search; connects speech alignment methods to theorem proving."
    }
  ],
  "pages_read_count": 7,
  "branch_decisions": [
    "Followed the mandatory required route without deviation \u2014 no branching was available or needed.",
    "Fetched pages in two parallel batches (2 then 5) to minimize wall-clock time while respecting sequential data availability.",
    "Used head -c 30000 on all fetches to bound output size; all pages returned well under the limit so no content was truncated in a loss-critical way."
  ],
  "source_checks": [
    {
      "claim": "Codex achieves 13.4% baseline formalization accuracy on ProofNet, rising to 16.1% with prompt retrieval",
      "checked_against": "http://34.8.208.118/md/render/paper/proofnet-autoformalizing-and-formally-proving-undergraduate-level-mathematics/art_f5a5f3551f6641598e578328d5771b3b \u2014 Key Results section",
      "status": "supported"
    },
    {
      "claim": "GPT-4 achieves 0% proof success on FIMO IMO-level problems even when provided the informal proof as a guide",
      "checked_against": "http://34.8.208.118/md/render/paper/fimo-a-challenge-formal-dataset-for-automated-theorem-proving/art_d65417c966cc4525ab8bd63248394361 \u2014 Key Insights section",
      "status": "supported"
    },
    {
      "claim": "StepFun-Prover-Preview-32B achieves 70.0% pass@1 on miniF2F, outperforming DeepSeek-Prover-V2-671B at 61.9%",
      "checked_against": "http://34.8.208.118/md/render/paper/stepfun-prover-preview-let-s-think-and-verify-step-by-step/art_804f191b837940a1a2b73568d30d29b0 \u2014 Experimental Results section",
      "status": "supported"
    },
    {
      "claim": "o3 achieves 72.6% CodeGen but only 4.9% ProofGen; end-to-end success drops to 3.2%; 64-round refinement reaches ~20% ProofGen",
      "checked_against": "http://34.8.208.118/md/render/paper/verina-benchmarking-verifiable-code-generation/art_ae1f0fb500a44279ba999f4c6df4ed6e \u2014 Key Insights section",
      "status": "supported"
    },
    {
      "claim": "Token-level alignment via DTW and contrastive loss over compiler state deltas can solve the credit assignment problem in iterative proof search",
      "checked_against": "http://34.8.208.118/md/render/hypothesis/token-level-alignment-of-informal-mathematics-to-formal-compiler-states-5f6a6e04c0bdbac4 \u2014 Core Claim and Proposed Experiments",
      "status": "proposal_level"
    },
    {
      "claim": "FIMO iterative-feedback pipeline raises formalization success from 32.6% to 60.8%",
      "checked_against": "http://34.8.208.118/md/render/paper/fimo-a-challenge-formal-dataset-for-automated-theorem-proving/art_d65417c966cc4525ab8bd63248394361 \u2014 Methodology section",
      "status": "supported"
    },
    {
      "claim": "Iterative refinement causes thrashing rather than convergence when the model lacks underlying deductive capability",
      "checked_against": "http://34.8.208.118/md/render/direction/iterative-compiler-feedback-for-formal-theorem-proving-12169 \u2014 The Semantic Wall section; corroborated by VERINA paper",
      "status": "limitation"
    }
  ],
  "final_question": "At what granularity of compiler feedback \u2014 tactic-block, subgoal, or token-level \u2014 does iterative LLM-REPL interaction most efficiently separate fixable translation errors from fundamental deductive gaps, and can token-to-compiler-state alignment (e.g., via DTW and contrastive loss on state deltas) measurably reduce thrashing on IMO-level benchmarks like FIMO?",
  "observations": [
    "Formalization and proof generation are sharply separable tasks with different difficulty ceilings: FIMO shows iterative LLM feedback raises formalization success from 32.6% to 60.8%, yet GPT-4 still achieves 0% proof success on the same IMO problems, and VERINA confirms this asymmetry at scale \u2014 o3 scores 72.6% on CodeGen but only 4.9% on ProofGen (sources: FIMO paper /md; VERINA paper /md).",
    "Tool-integrated RL with REPL feedback dominates raw parameter count: StepFun-Prover-Preview-32B achieves 70.0% pass@1 on miniF2F by training on live Lean 4 error trajectories via GRPO, outperforming DeepSeek-Prover-V2-671B (61.9%), suggesting that the quality of feedback integration \u2014 not model scale \u2014 is the current performance frontier (source: StepFun-Prover paper /md).",
    "Implicit mathematical hypotheses are a structural failure mode that compiler feedback cannot resolve: ProofNet documents that models consistently fail to infer unstated prerequisites required by formal libraries (e.g., that an orthogonal complement presupposes an inner product space), because these assumptions are present in human shared context but absent from both the problem prompt and the compiler's error messages (source: ProofNet paper /md; autoformalization gap direction /md)."
  ],
  "limitation": "Iterative compiler feedback exhibits diminishing returns and eventual thrashing on problems that exceed the model's deductive capability: VERINA shows that 64 rounds of Lean feedback raise ProofGen success only from ~5% to ~20% at massive token and inference cost, while the iterative compiler feedback direction page frames this as a 'semantic wall' \u2014 the compiler's binary error signals lack the semantic content needed to guide logical recovery, causing models to cycle through different incorrect proofs rather than converging. This means the technique is largely a surface-error fixer and cannot substitute for missing mathematical reasoning depth.",
  "blocking_model_calls_needed": 1,
  "notes": "All 7 required pages were reachable and returned full markdown content. Pages were fetched in two parallel Bash batches (2 + 5) within a single Claude Code session. No external search engines, PDFs, or non-Lacuna sources were consulted. The token-level alignment hypothesis page is categorized as proposal-level: it describes a proposed method with planned experiments, not yet empirically validated results. VERINA is an ICLR 2026 paper, consistent with the current date of 2026-05-08."
}
```

### pdf_to_chat_baseline

```json
{
  "condition": "pdf_to_chat_baseline",
  "route": [
    {
      "url": "https://arxiv.org/pdf/2302.12433",
      "reason": "ProofNet establishes the benchmark landscape for autoformalization and formal proving at undergraduate-mathematics level, revealing that parallel data scarcity is the foundational bottleneck (Codex 13.4% accuracy) and motivating the research question about what training-signal mechanisms can close this gap."
    },
    {
      "url": "https://arxiv.org/pdf/2309.04295",
      "reason": "FIMO shows that IMO-shortlisted problems define a qualitative difficulty cliff: GPT-4 achieves 0% Pass@8 on all 149 problems despite 21.3% on miniF2F, demonstrating that raw LLM capability without richer verifier interaction cannot bridge the gap to hard mathematical reasoning."
    },
    {
      "url": "https://stepfun.ai/research/en/stepfun-prover-preview",
      "reason": "StepFun-Prover-Preview provides the current SOTA result (70% pass@1 on miniF2F-test with a 32B model, surpassing a 671B competitor) and makes the causal argument: free-form, iterative Lean REPL interaction during both training (tool-integrated RL via GRPO) and inference is the key architectural lever, not model scale."
    },
    {
      "url": "https://arxiv.org/pdf/2505.23135",
      "reason": "VERINA extends the proof-generation bottleneck finding to software verification: even the best general-purpose LLM (o3) achieves only 4.9% proof success on structured coding tasks, and iterative refinement with Lean feedback raises this to only 20.1% at 64 steps\u2014confirming the bottleneck generalizes beyond mathematical benchmarks."
    }
  ],
  "pages_read_count": 4,
  "notes_written": [
    {
      "paper": "ProofNet: Autoformalizing and Formally Proving Undergraduate-Level Mathematics",
      "file": "proofnet_note.md",
      "approx_words": 713,
      "decision": "use"
    },
    {
      "paper": "FIMO: A Challenge Formal Dataset for Automated Theorem Proving",
      "file": "fimo_note.md",
      "approx_words": 763,
      "decision": "use"
    },
    {
      "paper": "StepFun-Prover Preview: Let's Think and Verify Step by Step",
      "file": "stepfun_note.md",
      "approx_words": 798,
      "decision": "use"
    },
    {
      "paper": "VERINA: Benchmarking Verifiable Code Generation",
      "file": "verina_note.md",
      "approx_words": 853,
      "decision": "use"
    }
  ],
  "source_checks": [
    {
      "claim": "Code-davinci-002 achieves 13.4% autoformalization accuracy on ProofNet; prompt retrieval raises it to 16.1% with a +21.5 percentage-point typecheck rate improvement",
      "checked_against": "proofnet_note.md / Table 3 in arXiv:2302.12433",
      "status": "supported"
    },
    {
      "claim": "GPT-4 achieves 0% Pass@1 and 0% Pass@8 on all 149 FIMO problems, compared to 21.3% Pass@8 on miniF2F validation set",
      "checked_against": "fimo_note.md / Table 9 in arXiv:2309.04295",
      "status": "supported"
    },
    {
      "claim": "StepFun-Prover-Preview-32B achieves 70.0% pass@1 on miniF2F-test, surpassing Kimina-Prover-72B (63.9%) and DeepSeek-Prover-V2-671B (61.9%) with a smaller model",
      "checked_against": "stepfun_note.md / Table 1 at stepfun.ai/research/en/stepfun-prover-preview",
      "status": "supported"
    },
    {
      "claim": "OpenAI o3 achieves 72.6% code correctness, 52.3% spec soundness+completeness, but only 4.9% proof success on VERINA; iterative refinement with Lean feedback reaches 20.1% at 64 steps",
      "checked_against": "verina_note.md / Figures 5\u20136 in arXiv:2505.23135",
      "status": "supported"
    },
    {
      "claim": "Tool-integrated RL with free-form Lean REPL interaction outperforms systems that allow only a single retry of verifier feedback",
      "checked_against": "stepfun_note.md / Introduction and Methodology sections at stepfun.ai",
      "status": "supported"
    },
    {
      "claim": "Proof generation is the bottleneck for end-to-end verifiable code generation: o3 achieves 41.2% code+spec jointly but only ~3.2% end-to-end with proof",
      "checked_against": "verina_note.md / Figure 6 in arXiv:2505.23135",
      "status": "supported"
    }
  ],
  "final_question": "Does integrating real-time formal verifier feedback into the LLM training loop via reinforcement learning\u2014rather than scaling model parameters or enlarging parallel informal-formal corpora\u2014represent the primary lever for advancing neural theorem proving systems across the full difficulty spectrum from undergraduate mathematics (ProofNet) to IMO-level problems (FIMO) and software verification (VERINA)?",
  "observations": [
    "Verifier-feedback integration during training, not model scale, drives the largest performance gains: StepFun-Prover-Preview-32B achieves 70% pass@1 on miniF2F by allowing the model to freely interact with the Lean 4 REPL during rollout and training via GRPO, surpassing DeepSeek-Prover-V2-671B (61.9%)\u2014a model more than 20\u00d7 larger\u2014by nearly 8 percentage points (source: stepfun_note.md / StepFun-Prover article, Table 1).",
    "A hard difficulty cliff separates the competition-math regime from research-level formal proving: GPT-4 achieves 21.3% Pass@8 on miniF2F but 0% Pass@8 on all 149 FIMO IMO-shortlisted problems, and ProofNet's Codex baseline of 13.4% on undergraduate mathematics similarly does not transfer\u2014indicating that benchmark-specific saturation in miniF2F masks genuine inability to generalize to harder formal reasoning tasks (source: fimo_note.md / FIMO Table 9; proofnet_note.md / ProofNet Table 3).",
    "Proof generation is the decisive bottleneck across both mathematical and software verification domains: VERINA shows that o3\u2014which achieves 72.6% code correctness\u2014falls to 4.9% proof success on structured programming verification tasks, and even with 64 Lean-guided refinement steps only reaches 20.1%, while the best specialized theorem-proving model (Goedel Prover V2 32B) reaches only 11.2%\u2014suggesting the formal proof generation problem is a domain-general failure mode, not specific to mathematical abstraction (source: verina_note.md / VERINA Figures 5\u20136)."
  ],
  "limitation": "The miniF2F benchmark where tool-integrated RL systems like StepFun-Prover achieve 70% pass@1 may be approaching saturation and reflects a narrow distribution of competition-math statement types; the same class of systems scores 0% on FIMO's IMO-shortlisted problems and under 12% on VERINA's software verification proofs, indicating that verifier-feedback training induces benchmark-specific rather than generalizable formal reasoning. The scarcity of high-quality parallel informal-formal data (flagged by ProofNet as a fundamental constraint) compounds this: without diverse hard proof examples to train on, RL-based systems cannot self-improve beyond the difficulty frontier of their training distribution.",
  "blocking_model_calls_needed": 6,
  "notes": "All four sources were successfully fetched and processed. The StepFun article is a Next.js server-rendered page; text was extracted by stripping HTML tags from the raw response\u2014the main article body was fully recoverable (~4,000 words). FIMO and ProofNet PDFs were extracted via pdftotext on pages 1\u201310 (approx. 40\u201349KB each). VERINA PDF is 10 pages at ICLR 2026 camera-ready density; the results section was cut off just after Figure 6 but the abstract and key quantitative results (code 72.6%, spec 52.3%, proof 4.9%, end-to-end 3.2%) were captured in the available text. No sources were unavailable. The blocking_model_calls_needed estimate of 6 reflects: 1 planning call + 4 sequential paper-summary calls (one per source, each requiring a full model response to process extracted text) + 1 final synthesis call."
}
```
