# CKM Benchmark — Predictive Hypothesis Generation from Evolving Literature

A benchmark for evaluating LLM-driven scientific-discovery systems against **literature published after the generation window**. Unlike contemporary-judge protocols (human review, agent peer review, wet-lab assays, self-evaluation), the CKM Benchmark grades each generated hypothesis by whether subsequent arXiv papers actually pursued the predicted direction.

## Why this benchmark

Most LLM-driven discovery systems today produce hypotheses, ideas, or full papers and validate them against contemporary judges. None test whether the system anticipates where a field is heading. The CKM Benchmark fills that gap with three properties:

1. **Predictive evaluation.** Each generated hypothesis is graded against papers published *after* the generation window (validation period: 2025-2027), not against contemporary judgments.
2. **50 ML topics, 8 categories.** Stress-tested across stable subfields (multilingual NLP) and fast-moving ones (LLM applications), so the benchmark exposes both hits and failure modes.
3. **Reproducible at academic budget.** A reference run on all 50 topics completes in ~44 hours wallclock and ~\$130 in API cost on a laptop driving APIs.

## Quick start

```bash
# 1. Install benchmark package
pip install -r requirements.txt

# 2. Configure API access (only needed for re-judging; reading results has no API cost)
cp .env.example .env
$EDITOR .env  # set OPENAI_API_KEY and any optional overrides

# 3. Inspect the 50 topics
cat data/topics.csv

# 4. Reproduce the leaderboard from released summary JSONs (no API calls)
python -m ckm_benchmark.recompute --summary results/lite_summary.json results/batch_summary.json results/full_summary.json
```

All configuration knobs (judge models, endpoint, timeouts, top-K) are documented in [`.env.example`](.env.example).

The full reference implementation (CKM-Lite, CKM-Full, Batch baseline) lives in [`reference_systems/`](reference_systems/) and shares its code path with the [`ckm-eval/`](../ckm-eval/) research harness.

## Directory layout

```
benchmark/
├── README.md                       # this file
├── LICENSE                         # MIT
├── data/
│   ├── topics.csv                  # 50 topics × 8 categories
│   ├── arxiv_ids/                  # frozen arXiv IDs per phase per topic (50 JSON files)
│   └── validation_pool.json        # consolidated future-ground-truth index (4,474 unique IDs)
├── docs/
│   ├── PROBLEM_STATEMENT.md        # formal task definition
│   ├── SUBMISSION_FORMAT.md        # how to submit your system's hypotheses
│   ├── EVALUATION.md               # judge protocol + safeguards
│   └── LEADERBOARD.md              # current standings
├── results/
│   ├── lite_summary.json           # CKM-Lite reference results (50 topics)
│   ├── full_summary.json           # CKM-Full instrumented variant
│   └── batch_summary.json          # one-shot Batch baseline
├── reference_systems/              # CKM-Lite / CKM-Full / Batch entry points
│   └── README.md
└── examples/
    └── example_hypothesis.json     # one valid hypothesis in submission format
```

## Current leaderboard (v0.1, 50 topics)

| System | Yield/topic | Hit% | Coverage | Tokens/topic | Wallclock/topic |
|---|---|---|---|---|---|
| **CKM-Lite** (this work) | **17.3** | **5.8%** | **36/50** | **0.51M** | 53 min |
| Batch (one-shot) | 13.7 | 3.0% | 15/50 | 5.93M | 55 min |
| CKM-Full (instrumented) | 17.8 | 1.4% | 11/50 | 5.87M | 67 min |

See [`docs/LEADERBOARD.md`](docs/LEADERBOARD.md) for full numbers and methodology.

## Submitting a system

Run your system on the 50 topics, emit hypotheses in the schema documented in [`docs/SUBMISSION_FORMAT.md`](docs/SUBMISSION_FORMAT.md), and open a pull request adding your `<your-system>_summary.json` under `results/` plus a one-paragraph entry in `docs/LEADERBOARD.md`. Submissions are re-judged in a single batch using the canonical judge to keep numbers comparable across systems.

## Citation

```bibtex
@inproceedings{ckm2026,
  title  = {Continuous Knowledge Metabolism: Generating Scientific Hypotheses from Evolving Literature},
  author = {[Anonymous]},
  booktitle = {AI4Research Workshop at ICML 2026},
  year   = {2026}
}
```

## License

MIT. See [LICENSE](LICENSE).
