# Reference Systems

Three reference systems are released alongside the v0.1 benchmark. They share a single code path located in [`../../ckm-eval/`](../../ckm-eval/) and are listed in [`docs/LEADERBOARD.md`](../docs/LEADERBOARD.md).

## CKM-Lite (the recommended starting point)

A 3-stage incremental workflow:

1. **Initialization** ($\leq$48 papers, 2019-2024): build the baseline knowledge state $\mathcal{K}_0$ as Markdown topic files.
2. **Window update** (six 2-month sliding windows over 2024-2025): fold in-window papers into $\mathcal{K}_t$ via append-and-tag operations.
3. **Hypothesis generation**: condition on $\mathcal{K}_t$, the running summary of $\mathcal{K}_{t-1} \to \mathcal{K}_t$ changes, and the cumulative prior hypotheses.

**Entry point**: `../../ckm-eval/scripts/lite/eval_single.py`
**Batch runner**: `../../ckm-eval/scripts/lite/batch_run.py`
**Per-run cost**: ~\$2-3 per topic, ~50 minutes wallclock.

## Batch (one-shot baseline)

A single LLM call per topic that ingests all in-window papers at once.

**Entry point**: `../../ckm-eval/scripts/pool/eval_single.py`
**Batch runner**: `../../ckm-eval/scripts/pool/batch_run.py`
**Per-run cost**: ~\$25 per topic (~11× CKM-Lite), ~55 minutes wallclock.

Included as a lower bound that any reasonable system must beat.

## CKM-Full (instrumented analytical variant)

CKM-Lite + an explicit change-detection step that classifies each window's update into 9 trigger types and feeds the labels into the generation prompt.

**Entry point**: `../../ckm-eval/scripts/eval_single.py`
**Batch runner**: `../../ckm-eval/scripts/batch_run.py`
**Per-run cost**: ~\$25 per topic, ~67 minutes wallclock.

Use as an analytical lens for studying the trade-off between hypothesis quality and coverage; not recommended for deployment (it has 4× lower coverage than Lite, see F1 in the paper).

## Reproducing the leaderboard run

```bash
cd ../../ckm-eval
pip install -r requirements.txt

# Set API keys
export OPENAI_API_KEY=sk-...
export GOOGLE_API_KEY=...

# Run CKM-Lite on all 50 topics (~44 hours, ~$130)
python scripts/lite/batch_run.py --concurrency 4

# Re-judge using the canonical judge to verify reproducibility
python -m ckm_benchmark.rejudge --summary results/lite_*/batch_summary.json
```

## Adding your system as a fourth reference

If your system is sufficiently different from the three above (e.g., a different model family, a multi-agent design, or a non-incremental architecture), and you make it open-source, we welcome it as a community-contributed reference system. Open an issue describing the system and the maintainers will help you wire it into the benchmark protocol.
