# Agent Workflow Evaluation

This folder contains the operational Claude Code trace used for the workshop paper.

The current paper uses the corrected `outputs/2026-05-08_pdf10_notes_baseline/` run. In that run, Lacuna navigation uses served `/md` pages, while the PDF baseline reads source text and writes per-paper Lacuna-style markdown notes before moving to the next source. The earlier `outputs/2026-05-08/` and `outputs/2026-05-08_pdf10_mdnav/` runs are retained for comparison.

## Conditions

- `lacuna_navigation_prompt.md`: Claude Code uses Lacuna `/md` pages and source-linked Lacuna paper/proposal pages.
- `pdf_to_chat_baseline_prompt.md`: Claude Code may not use Lacuna pages and must summarize fixed source-paper pages sequentially before synthesizing a question.

Both conditions use the same task: start from `automated theorem proving proof assistants` and produce one scoped research question, three source-linked observations, one limitation, and a route summary.

## Reproducing

From the repository root:

```bash
python3 manuscripts/ai4research_icml_2026_lacuna/supplement/agent_workflow_eval/run_claude_eval.py --model sonnet --max-budget-usd 1.00
```

This requires a working local Claude Code installation with network access. The script runs Claude Code in `/private/tmp/lacuna_workflow_eval`, captures stream JSON output, and writes summaries under `outputs/2026-05-08/`.

## Included Outputs

- `outputs/2026-05-08_pdf10_notes_baseline/results.md`: human-readable comparison table, tool commands, and final JSON outputs.
- `outputs/2026-05-08_pdf10_notes_baseline/*.summary.json`: machine-readable run summaries.

The raw Claude Code JSONL streams are kept out of the submission artifact because the `system:init` events include local machine paths and plugin metadata. The summarized outputs retain the measured timing, turn counts, tool commands, costs, and final task answers.
