# Dataset Card

The supplement contains sanitized outcome rows, not raw benchmark problems. Item identifiers are stable within each run. Rows record run id, benchmark label, carrier, condition, turn, correctness, attempted-answer flag, operation parse status, number of accepted operations, number of reducer rejections, and inclusion status.

The source benchmarks cited in the paper are MATH/MATH-500 Level 5, AIME 2024-25, and OlympiadBench.

OlympiadBench C3 effective denominator: 29 items. The registered run began from 30 submitted items; one item had an execution/error exclusion and is marked in `data/exclusions.csv` plus an `included=False` row in `data/headline_turns.csv`. The headline row-level T2 lift uses only included rows.

The uploadable row file is restricted to the registered T0-to-T2 endpoint. T3 diagnostics from source runs are not included in the headline reproduction table.
