# Evaluation Protocol

The CKM Benchmark uses a **two-stage cross-provider LLM judge** to grade each submitted hypothesis against papers published after the generation window. This document specifies the protocol so that submissions can be reproduced and audited.

## Judge architecture

```
                ┌──────────────────────────────────────────────────────┐
                │  Validation pool: arXiv papers from 2025-2027        │
                └─────────────────────┬────────────────────────────────┘
                                      │
                                      v
              ┌────────────────────────────────────────┐
              │  Embedding pre-filter                  │
              │  (text-embedding-3-small, OpenAI)      │
              │  → top-K candidates per hypothesis     │
              └─────────────────┬──────────────────────┘
                                │
                                v
              ┌────────────────────────────────────────┐
              │  Stage 1: GPT-4o-mini judge            │
              │  threshold 5.0 (broad pre-filter)      │
              └─────────────────┬──────────────────────┘
                                │
                                v
              ┌────────────────────────────────────────┐
              │  Stage 2: GPT-4o judge                 │
              │  threshold 6.0 (strict re-judgment)    │
              │  → final hit/miss decision             │
              └────────────────────────────────────────┘
```

## Step-by-step

For each generated hypothesis $h$:

1. **Embedding pre-filter.** Compute the embedding of the hypothesis statement via `text-embedding-3-small`. For every paper in the validation pool, compute cosine similarity. Retain the top-$K$ candidates (default $K=30$). This stage is model-agnostic and only widens the candidate pool, never narrows it.

2. **Stage 1 (pre-filter judge).** Send each candidate (title + abstract + first 2000 characters of body) to GPT-4o-mini with the canonical judge prompt. Score the hypothesis-paper alignment on a 1-10 scale. Retain candidates scoring $\geq 5.0$.

3. **Stage 2 (re-judge).** For surviving candidates, re-run the same judging task with GPT-4o, also scoring 1-10. Hypothesis $h$ is a **hit** if any candidate scores $\geq 6.0$.

4. **Best-match score.** Even for hypotheses that do not hit, record the maximum stage-2 alignment score. This continuous metric supports near-miss analysis.

## Canonical judge prompt (excerpt)

The judge sees both the hypothesis and a candidate paper and scores their alignment on five sub-dimensions:

```
Given a research hypothesis and a candidate future paper, score the
alignment on a 1-10 scale, where:
  10 = the matched paper realizes essentially the same proposal
   8 = same problem, similar method, different specifics
   6 = same general direction (HIT THRESHOLD)
   4 = related but with substantively different approach
   2 = same broad area but unrelated specifics
   1 = unrelated

Provide a one-paragraph rationale citing concrete shared elements.
```

The full prompt template is in [`reference_systems/lite/judge_prompts.md`](../reference_systems/lite/judge_prompts.md).

## Validity safeguards

The judge protocol embeds five safeguards against optimistic measurement:

1. **Pre-window-only generation input.** Generation prompts in submitted systems must see only papers from before the current evolution window. The judge does not enforce this; it relies on submission self-certification, with random spot-checks.
2. **Post-cutoff validation.** Validation papers (2025-2027) post-date the training cutoffs of GPT-4o (October 2023) and Gemini-2.5-Flash (Q1 2025), preventing trivial memorization.
3. **Cross-provider isolation.** Generation models (Google Gemini family) and judge models (OpenAI GPT-4o family) come from different providers. Submissions using OpenAI models for generation must use a non-OpenAI judge variant; see "Alternative-judge mode" below.
4. **Two-stage judging.** A weaker pre-filter widens candidates; the stronger judge re-scores at a higher threshold. This catches optimistic single-judge errors.
5. **Embedding pre-filter is model-agnostic.** The pre-filter only adds candidates; it never removes papers from consideration that the judge would have accepted.

## Reproducing the canonical judge run

The judge is deterministic given fixed seeds and models. To re-run on a submitted summary:

```bash
python -m ckm_benchmark.rejudge \
    --hypotheses path/to/your-system/hypotheses/ \
    --validation-pool data/validation_papers/ \
    --output results/your-system_summary.json
```

Expected runtime: ~2 hours per 50 topics on a single laptop with API access; cost ~\$10-15 in OpenAI tokens for re-judging.

## Alternative-judge mode (planned, v0.2)

For submissions using OpenAI models for generation (e.g., GPT-5-based systems), the canonical judge would create a self-preference risk. v0.2 will offer an alternative judge using Claude Opus or Gemini, with a calibration table mapping its scores to the canonical 1-10 scale.

## Known judge limitations

- **Surface-form bias.** Judges sometimes weight terminology overlap over conceptual alignment. We mitigate via two-stage judging and the embedding pre-filter.
- **Domain transfer.** The current judge prompt is calibrated on ML hypotheses; applying it to math or biomedicine may need re-calibration.
- **Score saturation.** A small fraction of hypotheses receive scores at the 1 or 10 endpoints; we report continuous best-match scores to capture mid-range alignment.

The judge prompt and scoring rubric are versioned in the repository (`v0.1` for the initial release). Future revisions will be tagged so that prior leaderboard entries remain reproducible against their original judge version.
