# CKM Benchmark Leaderboard

Current as of: **v0.1 (initial release)**

All numbers reproduced from `results/*_summary.json` using the canonical judge protocol described in [`EVALUATION.md`](EVALUATION.md).

## v0.1 — 50 topics

| Rank | System | Yield/topic | Hit% | Coverage | Tokens/topic | Wallclock/topic | Notes |
|------|---|---|---|---|---|---|---|
| 1 | **CKM-Lite** | **17.3** | **5.8** | **36/50 (72%)** | **0.51M** | 53 min | This work; incremental sliding-window state |
| 2 | Batch (one-shot) | 13.7 | 3.0 | 15/50 (30%) | 5.93M | 55 min | Naive baseline: dump all papers in one prompt |
| 3 | CKM-Full | 17.8 | 1.4 | 11/50 (22%) | 5.87M | 67 min | Instrumented variant: change-detection step |

### Statistical significance

- CKM-Lite vs Batch (hit rate): Wilcoxon paired test on per-topic hit rates, $p=0.006$.
- CKM-Lite vs Batch (yield): $p<0.0001$.
- CKM-Lite vs Batch (best-match score): $p=0.0003$.

### What the rankings tell us

- **CKM-Lite outperforms Batch** on every metric and is the system to beat. The win comes primarily from incremental sliding-window state plus structured per-hypothesis output.
- **CKM-Full has higher per-hit quality** (best-match score 4.58 vs 3.90 for Lite) but **lower coverage**. This is the F1 failure mode reported in the paper: heavier instrumentation can concentrate hypotheses around a narrower, higher-quality set at the cost of broad-spectrum coverage.
- **Batch is a real baseline**, not a strawman. It uses the same models and the same papers, only the orchestration differs. The 1.9× hit-rate gap is therefore attributable to the workflow, not to model quality.

## System notes

### CKM-Lite (this work)

- **Architecture**: 3-stage workflow (initialization, window update, hypothesis generation) over 6 sliding 2-month windows.
- **Generator**: Gemini-2.5-Flash.
- **Judge**: GPT-4o-mini → GPT-4o (canonical, two-stage).
- **Hypothesis emission**: structured Markdown with 7 fields (statement, problem, method delta, baseline, expected observable, evaluation plan, failure mode) plus source-paper citations and self-assessed novelty/feasibility/impact.
- **Reference run**: `lite_20260405_000653/`, 44.4 hours wallclock, ~\$130 total cost.

### Batch (one-shot baseline)

- **Architecture**: single prompt per topic that ingests all in-window papers (init + 6 evolution windows merged) and asks for hypotheses in one shot.
- **Generator**: Gemini-2.5-Flash (same as CKM-Lite).
- **Why included**: closest to what an LLM-using researcher would actually run today without any orchestration.

### CKM-Full (instrumented analytical probe)

- **Architecture**: CKM-Lite + an explicit change-detection step that classifies each window's update into 9 trigger types (Convergence, Bridge, Contradiction, Gap, Trend_Confirmed, etc.) and feeds the labels into the generation prompt.
- **Generator**: Gemini-2.5-Flash.
- **Why included**: not a deployment recommendation; included as the analytical lens for the F1 failure mode (instrumentation can hurt coverage).

## Submitting a new system

Open a pull request following [`SUBMISSION_FORMAT.md`](SUBMISSION_FORMAT.md). Each accepted submission updates this leaderboard with re-judged numbers.

## Versioning

- **v0.1** (current): 50 topics, frozen arXiv ID sets per phase per topic (`data/arxiv_ids/`), consolidated validation pool of 4,474 unique IDs (`data/validation_pool.json`), GPT-4o-family canonical judge, Gemini-2.5-Flash reference generator.
- **v0.2** (planned): alternative-judge mode for OpenAI-generated submissions (Claude or Gemini judge with calibration table); expanded topic count to 100.
- **v1.0** (long-term): cross-domain extension (math, biomedicine, theory) with calibrated judges; frozen content snapshot keyed by arXiv ID.

Numbers from one version are not directly comparable to numbers from another; submissions should specify the benchmark version they ran against.
