claim_id,claim,value,ci_or_count,source,safe_wording
L1,MATH-500 Level 5 original loop C2,0.1343,"[0.0746, 0.1940]",data/headline_turns.csv + scripts/reproduce_metrics.py,within-loop cumulative T2/pass-by-budget coverage
L2,MATH-500 Level 5 original loop C3,0.1716,"[0.1120, 0.2390]",data/headline_turns.csv + scripts/reproduce_metrics.py,within-loop cumulative T2/pass-by-budget coverage
L3,OlympiadBench original loop C2,0.0667,"[0.0, 0.1670]",data/headline_turns.csv + scripts/reproduce_metrics.py,bounded to this registered harness/source run
L4,OlympiadBench original loop C3,0.2414,"[0.1030, 0.4140]",data/headline_turns.csv + data/exclusions.csv,bounded to effective n=29 after one execution/error exclusion
L5,AIME 2024-25 original loop C2,0.1000,"[0.0, 0.2000]",data/headline_turns.csv + scripts/reproduce_metrics.py,boundary result with confidence interval touching zero
L6,AIME 2024-25 original loop C3,0.0000,"[0.0, 0.0]",data/headline_turns.csv + scripts/reproduce_metrics.py,null reported T2 endpoint
FRESH1,A/B/C*/D fixed-horizon endpoint counts,"A 36/36/36; B 53/35/40; C* 60/36/38; D 61/44/45","n=134",data/camera_ready_fresh_controls.csv,any-turn is oracle discovery; terminal and selector are realized endpoints
FRESH2,D versus repeated attempts on oracle discovery,+6.0 pp,"[+0.0,+11.9]; p=0.0963; Holm p=0.5775",data/camera_ready_fresh_paired_comparisons.csv,exploratory and not significant after multiplicity adjustment
FRESH3,D versus C* on oracle discovery,+0.8 pp,"[-6.7,+7.5]; p=1.0000",data/camera_ready_fresh_paired_comparisons.csv,no detected net oracle-coverage difference at n=134
FRESH4,D versus C* terminal,+6.0 pp,"[-1.5,+13.4]; p=0.1686",data/camera_ready_fresh_paired_comparisons.csv,higher terminal point estimate; paired evidence inconclusive
COST1,measured cost profile,"D total tokens 488512; B 246950; C* 464000","A/B/C*/D calls and tokens",data/camera_ready_cost_profile.csv,tokens are portable accounting; wall time is implementation-dependent
RET1,D discovery-retention gap,"61 any-turn; 44 terminal; 45 selector","selector gap 16",data/camera_ready_selector_retention.csv,correct answers are discovered more often than retained
CTRL1,F0-only rendering,+3.3 pp,"[-16.7,+23.3]",data/camera_ready_control_bounds.csv,initial field visibility alone is not enough
CTRL2,gold-substring guard,+23.3 pp,"[+10.0,+40.0]",data/camera_ready_control_bounds.csv,direct answer-string leakage not driving guarded lift
CTRL3,cross-item op shuffle,+10.0 pp,"[0.0,+23.3]",data/camera_ready_control_bounds.csv,structural scaffold contribution suggested but not isolated decisively
CTRL4,content-specific residual,+13.3 pp,"[-3.3,+30.0]",data/camera_ready_control_bounds.csv,content component unresolved at n=30
D1,easy paired directionality,+7.06 pp,"[+2.75,+12.16]",data/camera_ready_directionality_summary.csv,field rendering can help in easier paired regime
D2,hard paired directionality,-32.0 pp,"[-52.0,-11.0]",data/camera_ready_directionality_summary.csv,field rendering can hurt in hard paired regime
OP1,operation audit,"1119 proposed; 515 accepted; 604 rejected; 249 state-changing transitions","aggregate D arm",data/camera_ready_operation_audit.json,reducer checks local state validity rather than mathematical truth
ADAPT0,untrained 7B baseline,0.0667,6/90,data/adapter_evals.json,bounded hard-harness aggregate
ADAPT1,best trained 7B adapter,0.3667,33/90,data/adapter_evals.json,bounded hard-harness aggregate
ADAPT2,trained 3B adapter,0.3222,29/90,data/adapter_evals.json,bounded hard-harness aggregate
ADAPT3,comparison trained 7B adapter,0.3222,29/90,data/adapter_evals.json,bounded hard-harness aggregate
A1,typed-field training corpus size,5000 examples,n/a,data/artifact_footprint.json,footprint summary only; model weights excluded
A2,proof-construction corpus footprint,"200 trajectories; 50 theorem statements; 4 paths each",local symbolic checks,data/proof_corpus_summary.json + data/proof_corpus_index.jsonl,representational/local-check evidence only
A3,adapter hard harness footprint,30 problems x 3 attempts,90 attempts,data/adapter_evals.json,aggregate only; raw problem text and gold answers excluded
A4,OlympiadBench C3 effective denominator,29 effective items from 30 submitted,one execution/error exclusion,data/exclusions.csv,bounded to registered harness/source run
P1,related PRM diagnostic,"PRM CV AUC 0.518 vs structural-field CV AUC 0.945","598 examples; 107 unique problems; 5 folds",data/prm_field_diagnostic_summary.json,non-headline classifier diagnostic only
