# WS5 claim check (numeric, cross-domain)

This table is a **sanity check**: do we observe improvements in the currently available paper-level runs/tables?
Numbers are taken from the per-run JSON tables under `artifacts/tables/` (see manifest).

| Domain/run | Baseline → Method | Success (Δ) | SLO viol (Δ) | Tok after (Δ) | Lat P95 ms (Δ) | Wait ms/ep (Δ) | Deadlock (Δ) | Notes |
|---|---|---:|---:|---:|---:|---:|---:|---|
| Domain A (Habitat, spnoise, 30ep) | `nobrace_noprune` → `brace_prune_r0.7` | 100.0% (0.0pp) | 4.7% (-80.8pp) | 20.02 (-215.05) | 2500 (-177.07) | - (-) | - (-) | SPL=0.994 (Δ=0.00) |
| Domain A (Habitat, oracle, 30ep) | `nobrace_noprune` → `brace_prune_r0.7` | 100.0% (0.0pp) | 1.0% (-83.7pp) | 20.00 (-210.00) | 2487 (-175.86) | - (-) | - (-) | SPL=0.997 (Δ=0.00) |
| Domain B (RoboFactory, real LLM, 10ep) | `nobrace_none` → `brace_erecap_r0.7` | 100.0% (0.0pp) | 50.0% (-50.0pp) | 318.73 (-1247.63) | 1213 (-391.16) | 3546 (-5517.68) | 0.0% (0.0pp) | - |
| Domain B (RoboFactory, proxy tokenizer, 10ep) | `nobrace_none` → `brace_erecap_r0.7` | 100.0% (30.0pp) | 5.7% (-25.9pp) | 153.05 (-53.80) | 254 (-76.72) | 6930 (-1852.97) | 0.0% (-30.0pp) | - |
| Domain C (AirSimNH, paper runs, 10ep) | `baseline` → `brace_full` | 100.0% (0.0pp) | - (-) | 800.00 (-2224.75) | 1640 (-7360.00) | 0 (0.00) | 0.0% (0.0pp) | near_miss=24.9 (Δ=-4.40); min_dist=1.974 (Δ=0.09) |
| Domain B (OpenVLA executor, paper run, 10ep) — currently failing | `baseline_nobrace_recency__int10__B450` → `brace_erecap_r0.7__int10__B450` | 0.0% (0.0pp) | 0.0% (0.0pp) | 152.68 (-47.54) | 1 (0.09) | 19 (2.38) | 0.0% (0.0pp) | - |

## Interpretation notes (WS5)

- A domain can look “not better” in *Success* if success is already saturated (e.g., oracle executors).
- For BRACE vs non-BRACE attribution, prefer multi-agent domains (Domain B/C) and stability fields (deadlock/wait/churn).
- Budget-matched baselines (e.g., recency under fixed token budget) can be strong; the paper claim should emphasize tails/SLO + stability under context growth, not only point success in easy regimes.

