# pd-runner

Python orchestration layer for the Lean engine in `../engine`. Provides:

1. A CLI that runs program-vs-program matchups by generating Lean snippets
   and invoking `lake env lean` to discover the resulting actions.
2. An agentic proof-writing pipeline that uses an LLM to author Lean 4
   outcome proofs for bot pairs, with retrieval, tool-use, and a verified
   write-back into the library.
3. A FastAPI server with a minimal web UI that drives the end-to-end
   natural-language → bot → verified-outcome pipeline.

## Quick Start (uv)

```bash
cd app
uv sync
uv run run-matchup --left CooperateBot --right DefectBot
```

List bots discovered from the Lean engine:

```bash
uv run run-matchup --list-bots
```

### CLI parameters

- `--left BOT` / `--right BOT`: bot names, e.g. `CooperateBot`, `CupodBot:3`,
  `CupodBot:?`.
- `--list-bots`: print the bots discovered from the Lean engine and exit.
- `--claim-left C|D` and `--claim-right C|D`: check a claimed action pair.
- `--quiet`: print only the action pair.
- `--no-keep-file`: delete the generated Lean file after execution.
- `--json`: print the result as JSON.

### Parameterized bots

- `Bot:k` — concrete natural-number parameter, e.g. `CupodBot:3`.
- `Bot:?` — first tries universal-parameter theorems (`kind: all_parameters`,
  `witness: any`); falls back to existential (`kind: exists_parameter`,
  `witness: unknown`).
- Existential theorems do not certify concrete parameters: `CupodBot:?` can
  use an `∃ k` theorem, but `CupodBot:3` requires a theorem about that
  concrete `k`.
- Quote `Bot:?` in zsh/bash so `?` is not treated as a glob.

### Useful uv commands

- `uv sync --extra dev` — install dev tools (`pytest`, `ruff`).
- `uv sync --extra api` — install the API stack (`fastapi`, `uvicorn`).
- `uv run pytest` — run the test suite.

### Notes

- This app assumes the Lean project lives at `../engine`.
- Generated Lean snippets are written to `generated/lean/`; build/eval logs
  to `generated/logs/`; per-attempt outcome artifacts to `generated/outcomes/`.
- Unknown-bot errors include the available bot list.

---

## LLM proof agent

The `llm/`, `eval/`, and `services/` modules implement an agentic
proof-writing pipeline. Given a bot pair, the agent automatically writes
and verifies a Lean 4 outcome proof.

### Architecture

```
ProofRequest (bot pair + outcome stub)
    │
    ├─ retrieval.py        — fetch relevant existing theorem files as few-shot context
    ├─ prompts.py          — build system prompt (embeds Program.lean + Dynamics.lean
    │                        verbatim; adds Axioms.lean for .search bots)
    │
    └─ proof_service.py    — agentic loop (LLM client + tool handler)
            │
            ├─ run_lean_proof tool    — write candidate to temp file, run lake env lean,
            │                           return any errors
            └─ read_library_file tool — read any file under engine/PrisonersDilemma/
            │
            └─ ProofResult (verified Lean source + iteration count)
                    │
                    └─ library_writer.py — write to engine/PrisonersDilemma/Theorems/,
                                           verify with lake build, roll back on failure,
                                           optional human-acceptance gate
```

### Module reference

| Module | Purpose |
|---|---|
| `llm/client.py` | Multi-turn tool-use loop with adaptive thinking and system-prompt caching. `ToolHandler` registry maps tool names to Python callables. |
| `llm/tools.py` | Tool schemas and implementations: `run_lean_proof` (per-iteration check, no full `lake build`) and `read_library_file`. |
| `llm/retrieval.py` | `retrieve_few_shots` — most relevant existing theorem files for a bot pair (name match then content match). `list_known_outcome_theorems` — scans the discovered theorem registry for already-proven outcomes involving the bots. |
| `llm/prompts.py` | `build_system_prompt` — embeds `Program.lean` and `Dynamics.lean` source; adds `Axioms.lean` when either bot uses `.search`. `proof_request_message` — per-request user message with the theorem stub, known theorems, and few-shot files. |
| `services/proof_service.py` | `search_proof(ProofRequest) -> ProofResult` — the agentic loop. Raises `ProofSearchError` on failure. |
| `services/library_writer.py` | `write_proof_to_library(ProofResult)` — writes a proven proof to `engine/PrisonersDilemma/Theorems/`, verifies with `lake build`, rolls back on failure. Never overwrites existing files. |
| `eval/harness.py` | Evaluation harness: re-proves held-out theorems and reports pass rate, iteration count, and wall time. |

### Running the eval harness

Requires an LLM API key in the environment (see `.env.example`).

```bash
# Full eval run
uv run python -m pd_runner.eval.harness --output results.json

# Dry run — tests retrieval + prompt plumbing without any API calls
uv run python -m pd_runner.eval.harness --dry-run
```

Key options:

- `--model MODEL` — model ID (see `.env.example` for supported providers).
- `--max-iterations N` — max tool-use iterations per proof (default: 20).
- `--cases [N [M]]` — case selection (0-indexed): omit for all, one int for
  first N, two ints for slice `N:M`.
- `--log-level LEVEL` — `WARNING` (default, silent) / `INFO` (tool call
  summaries) / `DEBUG` (responses + Lean source + tool results) / `TRACE`
  (also dumps system prompt and user message). Logs go to stderr; redirect
  to keep the summary table clean: `... --log-level DEBUG 2>debug.log`.

### Eval case tiers

| Tier | Bots | Difficulty |
|---|---|---|
| 1 | CooperateBot, DefectBot | Trivial — `.const` action, `unfold` + `rfl` |
| 2 | MirrorBot, OBot, DBot | One-step simulation, `unfold` + `simp` |
| 3 | TitForTatBot, EBot | Multi-step, requires inspecting bot definitions |

---

## API server

```bash
uv run pd-serve --reload
```

Serves the natural-language → bot → verified-outcome pipeline at the root
URL. The pipeline is a two-step async job with human acceptance gates at
(1) the generated bot definitions and (2) the proven outcome theorem.
