# Problem Statement

## Task

Given a research **topic** $q$ and a **time horizon** $[t_\text{init}, t_T]$, an LLM-driven scientific-discovery system must produce a ranked set of **research hypotheses** $\mathcal{H}$ such that each hypothesis is:

1. **Anchored** in the literature available at generation time (no leakage from the future).
2. **Structured** enough to permit downstream verification (claim, research plan, source citations, self-assessment).
3. **Predictive** of where the field is heading: a hypothesis "hits" if at least one paper published **after** $t_T$ pursues a substantively similar direction.

## Temporal protocol (3 phases)

For each of the 50 benchmark topics:

| Phase | Period | Role | Paper budget |
|---|---|---|---|
| **Initialization** | 2019-2024 | Build the baseline knowledge state $\mathcal{K}_0$ | $\leq$48 papers |
| **Evolution** | 2024-2025 | Six 2-month sliding windows; each window adds papers and emits hypotheses | $\leq$96 papers/window |
| **Validation** | 2025-2027 | Future ground truth; each generated hypothesis is graded against this set | up to 180 papers used as candidates |

A submitted system must:
- See **only** initialization-phase papers when building $\mathcal{K}_0$.
- See **only** papers in window $t$ (not future windows) when generating hypotheses for window $t$.
- **Never** see validation-phase papers; the evaluation harness handles the validation phase separately.

## What counts as a "hit"

A hypothesis is judged a **hit** if at least one validation paper aligns with it at score $\geq 6.0$ on a 1-10 LLM-judge scale, where the judge is:
1. **Two-stage**: GPT-4o-mini pre-filter (threshold 5.0) → GPT-4o re-judgment (threshold 6.0).
2. **Cross-provider**: judges use OpenAI models; reference systems use Gemini-2.5-Flash for generation, so weights are never shared between generator and judge.
3. **Embedding-pre-filtered**: candidates from the validation pool are surfaced by embedding similarity against the hypothesis text, then judged in full.

Full judge protocol in [`EVALUATION.md`](EVALUATION.md).

## Metrics

For a system run across the 50 topics:

- **Hit rate** ($\%$): fraction of generated hypotheses that hit.
- **Coverage** ($n/50$): topics where at least one hypothesis hits.
- **Yield**: mean hypotheses generated per topic.
- **Best-match score**: highest alignment score per hypothesis (continuous, even below threshold).
- **Temporal lead** (days): for hits, time between hypothesis generation and matched-paper publication.
- **Token cost** (per-topic mean tokens consumed by generation).

## Why predictive validation

Contemporary judges (human review, agent peer review, wet-lab assays, simulated tournaments, self-evaluation) measure whether an output **looks good** to the judge today. They do not test whether the output **anticipates the field**.

Predictive validation against future literature has three properties contemporary judges lack:
1. **Falsifiability**: the answer is determined by what the field actually does, not by current opinion.
2. **No self-preference bias**: the validator (future arXiv) is independent of any LLM.
3. **Long-horizon signal**: validation lead times in our reference run range from ~120 to ~680 days, exposing systems' ability to anticipate over a multi-quarter horizon.

The trade-off: predictive validation **rewards alignment with where the field went**, which can penalize genuinely novel directions the field has not yet visited. Hits should be read as a lower-bound measure of usefulness, not an upper bound on novelty.

## Out of scope

- **Single-domain validation** (e.g., wet-lab biology only): we test on ML so that the workflow is generalizable; future versions will extend to math, theory, and biomedicine.
- **Human user studies**: orthogonal to predictive validation; both are useful and complementary.
- **End-to-end paper generation**: this benchmark targets hypothesis-level prediction, not full-paper writing.
