# Benchmark Data

This directory holds the data artifacts that define the CKM Benchmark v0.1.

## `topics.csv`

The 50 benchmark topics, one row per topic, with three columns:

- `slug`: machine-friendly identifier (matches JSON keys in `results/*_summary.json`)
- `category`: human-readable category name used in the paper's eight-category groupings
- `name`: full topic name used by the runner (matches `arxiv_search` queries)

Topics span eight categories (multilingual & speech, LLM core capabilities, LLM applications, safety & trust, multimodal, domain-specific AI, efficiency & data, other foundations) with deliberately heterogeneous evolution rates.

Selection criteria (verified against arXiv 2019-2027 at the time of v0.1 release):

- $\geq 30$ papers in $[2019, 2024]$ for initialization
- $\geq 50$ papers in $[2024, 2025]$ for evolution
- $\geq 30$ papers in $[2025, 2027]$ for validation
- $\geq 100$ papers total per topic

## `arxiv_ids/`

Per-topic frozen arXiv ID lists, extracted from the canonical CKM-Lite reference run (`lite_20260405_000653`). One JSON file per topic, total 50 files:

```json
// arxiv_ids/long_context_understanding_in_large_language_models.json
{
  "slug": "long_context_understanding_in_large_language_models",
  "init": ["2106.09685", "2107.02192", ...],          // up to 48 papers from 2019-2024
  "evolution": {
    "2024-01": ["2401.00091", ...],                    // up to 96 papers per window
    "2024-03": [...],
    "2024-05": [...],
    "2024-07": [...],
    "2024-09": [...],
    "2024-11": [...]
  },
  "validation": ["2501.04321", "2502.18293", ...],     // future ground truth, 2025-2027
  "_provenance": {
    "init_count": 38,
    "evolution_window_count": 6,
    "evolution_total_papers": 47,
    "validation_count": 86
  }
}
```

Aggregate across 50 topics: **1,864 init papers, 2,166 evolution papers, 4,866 validation papers** (4,474 unique arXiv IDs in the validation pool).

Reproducing the benchmark is now deterministic: any system can submit results conditioned on these exact ID sets, without re-running arXiv search and risking variance from arXiv's incremental indexing.

## `validation_pool.json`

A consolidated index of every validation arXiv ID across all 50 topics, with per-topic attribution and per-paper "appears-in-which-topics" information. Generated from `arxiv_ids/` via:

```bash
python -m ckm_benchmark.build_validation_pool \
    --arxiv-ids-dir data/arxiv_ids \
    --output data/validation_pool.json
```

This is the single file submitters need to know which papers their system will be judged against.

## Fetching paper content

The benchmark ships only arXiv IDs, not paper content (avoids licensing complications and keeps the repo small). Submitters fetch full text from `https://arxiv.org/abs/<arxiv_id>` directly or via the arXiv API. A frozen content snapshot keyed by arXiv ID is planned for v1.0.
