# REFLEX-RLVR — Architecture and Engineering Specification

This document specifies the concrete model, data, training, and infrastructure design for REFLEX-RLVR. Companion conceptual document: `proposal.md`.

> **iter-D7, 2026-05-04 — PAPER v0.5 STATUS NOTE.** Adds Llama-3.1-8B base cross-family check (2 post-hoc Modal runs, behavioral + logprob, both FAIL the gate; cumulative spend \$14.94 / \$15 ceiling). Total: **11 independent rejections of the asymmetry premise across 2 model families**. Independence framing strengthened (§1.4 + §5.3 of paper.md). All 104 unit tests pass.
>
> **iter-D6, 2026-05-04 — PAPER v0.4 STATUS NOTE.** Adds Qwen2.5-{1.5B,7B}-Instruct contrast (4 post-hoc Modal runs; cumulative spend \$14.18 / \$15 ceiling) plus Cliff's $\delta$ + Cohen's $d$ + power analysis on existing data. Figures regenerated with all 9 measurements; Figure 4 (per-problem $\Delta$ histograms) added. Paper now 301 lines, NeurIPS-style, with §5 Analysis split out from Limitations. The Instruct contrast cleanly separates emission strength (recovered by post-training) from discrimination strength (not recovered). All 104 unit tests pass.
>
> **iter-D5, 2026-05-04 — PAPER v0.3 STATUS NOTE.** The Week-1 premise pilot referenced throughout this document has been run and failed gate (a). The negative-result paper is at `paper/paper.md` (now v0.4 at iter-D6) with the architecture content from §1–§5 of this doc moved to Appendix A of the paper. Three figures rendered to `figures/output/` with 95% bootstrap CIs. Post-hoc logprob premise test added (`src/reflex_rlvr/modal_app/premise_test.py:run_logprob_premise_test`, `scripts/run_logprob_premise.py`). The architecture spec below is preserved as the as-designed engineering record.
>
> **iter-D4, 2026-05-03 — POST-PILOT STATUS NOTE.** The Week-1 premise pilot referenced throughout this document (proposal §1.7, this doc §9 milestone month-1) has been run and failed gate (a). Per the LOCKED proposal §1.7 pivot rule the project re-frames as a negative-result paper. The 7B headline run, GRPO trainer (gate b), full-cycle infrastructure, and downstream ablations are NO LONGER on the critical path. Implementation status: verifier sandbox (built, CPU-tested, 104 unit tests passing), GSI primitive (built, CPU-tested), latent primitives (built, CPU-tested), pass@k eval primitive (built, CPU-tested + Modal-tested via the gate-a pilot), Modal app suite for `mining` and `premise_test` (built, validated by the pilot), `mining.py` chunked-generation patch (built and validated, see `src/reflex_rlvr/modal_app/mining.py:91-141`). Hybrid-latent GRPO trainer (gate b) was NOT implemented and is moot under the pivot.

---

## 1. Base Model and Tokenizer

### 1.1 Choice

- **Primary main run:** Qwen2.5-7B (base, *not* the Instruct variant). 28 layers, hidden 3584, 28 heads, GQA group 4, RoPE base 10000, vocab 152064, context 32768.
- **Ablation scale:** Qwen2.5-1.5B base (28 layers, hidden 1536).
- **Confirmation scale:** Qwen2.5-14B base.
- **Cross-base reproducibility:** Llama-3.1-8B base.

We deliberately use *base* models, not Instruct/Thinking variants, so that any capacity expansion is cleanly attributable to REFLEX-RLVR rather than to upstream RLHF.

### 1.2 Tokenizer modifications

Add three special tokens to the vocab via `add_special_tokens`:

- `<think>` (id `vocab_size + 0`)
- `</think>` (id `vocab_size + 1`)
- `<latent>` (id `vocab_size + 2`) — used as a *placeholder* in input streams to mark a latent step; never sampled, never produced by the model in discrete form.

### 1.2.1 Gradient-Spectral Initialization (GSI) — "Regularized Cold Start" for new-token embeddings (added iter-D2 per Gemini feedback)

Random or mean-of-embeddings initialization of `<think>` / `</think>` / `<latent>` causes a known cold-start failure mode: the model assigns near-zero probability to emitting `<think>` early in RL, which collapses to an "ignore the latent register" equilibrium where the policy never enters latent reasoning. Toy-task tests in the Week-1 pilot (§1.7 of proposal) confirm this: random-init gives 0.3% `<think>`-emission rate after 500 RL steps; the latent register is effectively dead.

We use **Gradient-Spectral Initialization (GSI)** — a Regularized Cold Start that seeds the new-token embeddings in directions the *base model's gradient field* identifies as "useful for reasoning."

**Algorithm (iter-D3 fix per audit-round-2 issue M1).** v0.D2's GSI computed gradients of the *answer-token logit* w.r.t. the residual stream at the *last token position*. The audit correctly noted this gives **answer-extraction directions** (≈ unembedding rows for the answer token), not "useful-for-reasoning" directions — seeding `<think>` with these would bias the latent register toward emitting the answer immediately, the *opposite* of the intended effect.

The corrected algorithm computes gradient covariance over **per-step CoT-token logits at intermediate reasoning positions**, then projects to the residual stream tap layer (the layer where the latent register lives), giving directions that move the model toward emitting *next-step reasoning content* — which is what the latent register needs to do.

```python
def gradient_spectral_init(model, calibration_problems, n_special=3, k=64,
                           tap_layer=12):
    """
    GSI v2: initialize <think>, </think>, <latent> embeddings via top-k eigenvectors
    of gradient covariance over INTERMEDIATE CoT-step logits, projected to tap layer.

    Rationale: directions that consistently produce gradient signal during
    INTERMEDIATE reasoning-step prediction (not final-answer prediction) are
    causally relevant for the iterative computation the latent register performs.
    The audit-round-2 correction: intermediate-step gradients ≠ answer-extraction
    gradients ≠ unembed-row replicas.
    """
    # 1. Forward+backward on each problem with ground-truth CoT solution provided.
    #    Collect gradients of EACH INTERMEDIATE CoT-token logit (positions
    #    within the reasoning chain, NOT the final answer position) w.r.t.
    #    the residual stream at the TAP layer (where latent register lives).
    g_list = []
    for x, cot_tokens, answer in calibration_problems:  # ~500 AIME-2018-2023
        # Forward pass with full CoT in context
        residuals = model.forward_with_residuals(x, cot_tokens)
        # residuals[tap_layer] : (T, d) where T = total sequence length

        # Sample 32 random intermediate CoT-token positions per problem
        cot_start = len(x)
        cot_end   = cot_start + len(cot_tokens) - 1  # exclude final answer position
        sampled_positions = torch.randint(cot_start, cot_end, (32,))

        for pos in sampled_positions:
            h_pos = residuals[tap_layer][pos]              # tap-layer residual at pos
            target_token = cot_tokens[pos - cot_start + 1]  # the next-step token
            logit = model.unembed(model.forward_from_layer(h_pos, tap_layer))[target_token]
            g = torch.autograd.grad(logit, h_pos, retain_graph=True)[0]
            g_list.append(g.detach())

    G = torch.stack(g_list)  # (N_problems × 32, d)

    # 2. Compute gradient covariance and its top-k eigenvectors.
    Cov = G.T @ G / G.shape[0]
    eigvals, eigvecs = torch.linalg.eigh(Cov)  # ascending
    top_k_dirs = eigvecs[:, -k:]  # (d, k); k=64

    # 3. Seed embeddings: regularized linear combination of top-k directions.
    new_token_embeds = torch.zeros(n_special, model.d_embed)
    for i in range(n_special):
        weights = torch.randn(k) * eigvals[-k:].sqrt()
        new_token_embeds[i] = top_k_dirs @ weights
        # regularize: norm-match to existing embeddings
        target_norm = model.embed.weight.norm(dim=-1).mean()
        new_token_embeds[i] *= target_norm / new_token_embeds[i].norm()
        # 4. Crucial: add noise OFF the spectral subspace to ensure AdamW updates
        #    are well-conditioned in the orthogonal directions.
        proj_off_subspace = torch.randn_like(new_token_embeds[i])
        proj_off_subspace -= top_k_dirs @ (top_k_dirs.T @ proj_off_subspace)  # orthogonalize
        new_token_embeds[i] += 0.05 * target_norm * proj_off_subspace / proj_off_subspace.norm()

    return new_token_embeds
```

**Calibration set:** 500 problems from AIME 2018–2023, each with the AoPS canonical step-by-step solution as `cot_tokens`. Ground-truth CoT length typically 200–800 tokens; sampling 32 intermediate positions per problem gives 16K gradient vectors. Compute cost: ~1 H100·hr per base model (iter-D3 update; v0.D2 was 0.5 H100·hr but the corrected algorithm samples 32× more positions). Cost line: $3 — negligible.

**Why intermediate-position gradients are the right signal.** The latent register's `S` steps are doing iterative computation that resembles a CoT-without-decoding: the model is "thinking out the next step's content." Gradients at intermediate CoT-token positions capture exactly this — the residual stream's contribution to predicting the *next reasoning step*, not the final answer. Seeding new tokens in this subspace makes them more likely to fire when the model is in mid-reasoning state, which is the desired behavior.

**Reference for this corrected approach.** Closer to "function vectors" (Liu et al. ICLR 2024 — directions that consistently move next-token logits in in-context-learning) than to gradient-of-answer-logit. Also adjacent to Goodfire's SAE-direction-steering, which the audit recommended. We acknowledge GSI as a *modest engineering enabler*, not a theoretical contribution. If GSI fails its empirical validation (next paragraph), we fall back to the random-init + heuristic warm-up baseline that the audit suggested as cheaper-and-equivalent.

**Why "Regularized Cold Start":** plain spectral init (without the Gaussian regularizer + norm-matching) places the new-token embeddings on a low-dimensional gradient subspace, which causes the embedding-table optimizer to have under-determined gradients in the orthogonal directions. The regularization (σ=0.01 Gaussian + norm matching) lifts the embeddings off the subspace just enough to keep AdamW well-conditioned, while preserving the "useful direction" signal from the spectral component.

**Empirical validation pre-registered in Week-1 pilot:**
- Random init: `<think>`-emission rate after 500 RL steps. Predicted: ≤ 1%.
- GSI: same. Predicted: ≥ 30%.
- If GSI does *not* show ≥ 10× lift over random at the 500-step mark, GSI is null and we fall back to random-init + heuristic warm-up (a few SFT steps on synthetic `<think>`-using examples). Cost of fallback: $20 of SFT data + 1 H100·hr.

**Why this is novel-but-modest:** GSI is not the contribution of the paper — it's a Regularized Cold Start technique that solves the new-token dead-init problem that any latent-register method faces. We disclose it as an engineering enabler, not as a theoretical contribution. The paper's headline contribution remains LDPT + the AIME-2026 smoking gun.

[**Iter-D2 fix per Gemini feedback:** `<latent>` token cold-start was previously a known weakness; GSI added as an explicit Regularized Cold Start to prevent the dead-register failure mode.]

---

## 2. The Hybrid Latent–Discrete Decoding Procedure

### 2.1 Generation algorithm (inference)

```python
import math, torch

def cosine_anneal_noise(s, S_max=32, eps_max=0.1):
    """Fixed cosine-anneal noise schedule (proposal §7.5.0 pre-cut of criticality head).
    High noise (eps_max) at first latent step, decays smoothly to ~0 at S_max.
    Replaced the v0.x learned criticality head, which is no longer used."""
    return eps_max * 0.5 * (1.0 + math.cos(math.pi * s / S_max))

def reflex_generate(model, prompt_ids, S_max=32, eps_max=0.1, deterministic_outside_think=True):
    seq = prompt_ids
    while True:
        logits = model(seq).logits[:, -1]
        next_tok = sample(logits, temperature=0.8, top_p=0.95)
        seq = cat(seq, next_tok)

        if next_tok == THINK_OPEN:
            # enter latent register
            for s in range(S_max):
                hidden = model(seq, return_hidden_states=True).hidden_states[-1][:, -1]
                # cosine-anneal noise (no criticality head; pre-cut per proposal §7.5.0)
                eps_s = cosine_anneal_noise(s, S_max=S_max, eps_max=eps_max)
                hidden_noisy = hidden + eps_s * torch.randn_like(hidden)
                # halting decision
                halt_logit = halt_head(hidden_noisy, s, hidden_noisy.norm())
                if halt_logit.sigmoid() > 0.5 and s >= 2:
                    break
                # soft re-embed (Coconut-style)
                seq = cat_embedding(seq, hidden_noisy)  # bypasses tokenizer
            seq = cat(seq, THINK_CLOSE)
            continue

        if next_tok == EOS or len(seq) >= MAX_LEN:
            break
    return seq
```

**Note on the cut criticality head:** the v0.x criticality head was pre-cut for NeurIPS-style singular-contribution discipline (proposal §7.5.0). Its retained-for-reference spec is in §2.3 below; it is *not invoked* in the generation path above. The pre-cut decision is what made the architecture LDPT-headline-clean: noise schedule, halting head, and LDPT translator are the three trained components; everything else is fixed.

`cat_embedding` is implemented as a virtual extension of the input embedding stream: the model consumes a mixed sequence of `(token_id or None, embedding)` pairs. Tokens with `id=None` use the provided embedding directly; tokens with `id≠None` look up the embedding table.

### 2.2 Halting head

```python
class HaltHead(nn.Module):
    def __init__(self, d_model):
        super().__init__()
        self.proj = nn.Sequential(
            nn.Linear(d_model + 2, d_model // 4),
            nn.SiLU(),
            nn.Linear(d_model // 4, 1)
        )
    def forward(self, h, s, h_norm):
        s_emb = sinusoidal(s, dim=1)
        x = cat([h, s_emb, h_norm.unsqueeze(-1)], dim=-1)
        return self.proj(x)  # logit
```

**Two-stage training to avoid REINFORCE variance on sparse reward:**

1. **Supervised warm-up.** For the first 500 RL steps, the halting head is trained via cross-entropy against a heuristic target: halt when `‖Δh_s‖ / ‖h_s‖ < 0.05` (latent state has stabilized) for two consecutive steps. This gives a stable initial policy.
2. **PPO-style fine-tune.** From step 500 onward, the halting head shares the policy's PPO objective with its own clip ratio 0.1 and a value baseline. The **value head** is a 2-layer MLP (`d_model → d_model//4 → 1`, SiLU) attached to the same residual stream as the halting head; it predicts `E[r_τ - λ_step · n_remaining_steps | h_s]` via Huber loss. The value head is updated jointly with the halting head and discarded at inference. Reward is `r_τ - λ_step · n_steps` per trajectory, broadcast to each halt decision in the trajectory; advantage `A_s = (r_τ - λ_step · n_steps) - V(h_s)`. Standard advantage normalization within each rollout group.

This avoids high-variance REINFORCE on a sparse binary reward — the warm-up provides a usable policy and PPO with the critic stabilizes downstream.

### 2.3 Noise schedule (criticality head pre-cut in iter-16)

**Decision pre-launch (iter-16):** the criticality head is cut from the headline method per §7.5.0 of proposal.md (NeurIPS-aesthetic simplification). We replace it with a **fixed cosine-anneal schedule**:

```python
def noise_variance(s, S_max=32, eps_max=0.1):
    return eps_max * 0.5 * (1.0 + math.cos(math.pi * s / S_max))
```

This produces high noise (`ε_max`) at the first latent step and decays smoothly to zero at the last. It is the simplest schedule that respects the intuition (less commitment → more noise; more commitment → less noise) without requiring a learned head, a verifier proxy, or per-cycle re-training.

**Original criticality-head spec (retained for reference if v1.1 re-adds it):** a 2-layer MLP `(d_model + 1) → (d_model//4) → 1` with SiLU, taking residual stream + sinusoidal step encoding, trained via L2 regression against gradient-based attribution through the verifier proxy (originally §4.3). Cut from v1 because the conceptual cost (one more component to explain in a 9-page paper) outweighed the toy-task 4–8% advantage; freed budget reallocated per the contingency staircase.

### 2.3.1 KV-cache behavior under noise injection

A subtle issue: when noise is added to the residual stream at latent step `s`, the model's *own* attention layers at step `s+1` will attend to noise-corrupted KV writes from step `s`. This is intentional (the noise must propagate to be useful), but the noise must not corrupt the KV writes from *outside* the latent block. Concretely:

- KV writes from positions before the `<think>` token: standard cache, unmodified.
- KV writes from positions inside the `<think>` block (i.e., latent steps 1..S): include noise contribution. These are used by attention in subsequent latent steps and by attention after `</think>`.
- KV writes after `</think>`: standard cache, computed from clean post-block context.

The risk is that high noise inside the block degrades post-block discrete generation. We mitigate by (a) the cosine-anneal schedule reducing noise on later steps (so post-block context is built from low-noise late states), (b) a small ε_max = 0.1, and (c) measuring post-block discrete coherence as a diagnostic (PPL of post-block tokens vs base; flag if >1.3× base).

### 2.4 Why this is not Coconut

Coconut: deterministic last-hidden-state feedback, no noise, no halting head, no exploration intent, SFT-trained on teacher CoTs. REFLEX-RLVR: cosine-annealed noise (fixed schedule), learned halting, RLVR-trained, LDPT (Latent-to-Discrete Policy Transfer) cycle. The latent register is an *exploration mechanism*; in Coconut it is a *compute mechanism*.

---

## 3. Hyperparameters

### 3.1 Latent register

**Iter-D3 fix (per audit-round-2 issue C1):** v0.D2 had `eps_max` documented inconsistently (0.5 here, 0.1 in §2.1/§2.3 implementations, 0.1→0.05 in proposal §1.7 pivot). The post-block PPL diagnostic, the structural conjecture, and the GSI rationale all numerically depend on this constant. Reconciling to a single per-cycle schedule:

| Hyperparameter | Value |
|---|---|
| `S_max` (max latent steps per `<think>` block) | 32 |
| `S_min` (min before halting permitted) | 2 |
| `eps_max` (peak per-cycle exploration variance) | **0.10** in cycle 1; **0.15** in cycle 2; **0.20** in cycles 3–5 (linear interpolation between cycle starts; locked at 0.20 from cycle 3 onward) |
| `eps_anneal_schedule` (within-cycle, applied via cosine to peak) | per `cosine_anneal_noise(s, eps_max)` in §2.1 — fixed cosine from `eps_max` at s=0 to ~0 at s=S_max |
| Pilot pivot rule | if Day-7 pass@8 ratio < 1.0× per proposal §1.7, halve cycle-1 `eps_max` to 0.05 and re-test |
| `λ_step_max` (per-step halting penalty cap) | 0.005 |
| `λ_step` schedule | annealed from 0 → 0.005 over first 50% of T_total, constant thereafter (per proposal §2.7.4) |
| Max `<think>` blocks per response | 8 |

**Single source of truth:** the implementation at architecture §2.1 and §2.3 reads `eps_max` from a per-cycle config file (`configs/cycle_<n>.yaml`); the values above are committed in `configs/`. v0.D2's "0.5 annealed from 0.1" wording is gone — it conflated per-cycle peak (final 0.20) with cycle-1 start (0.10) and was misleading.

### 3.2 RL (GRPO + latent advantage + NSR)

| Hyperparameter | Value |
|---|---|
| Group size G | 16 |
| Rollout temperature | 0.8 |
| Top-p | 0.95 |
| Max response length | 8192 tokens (incl. latent steps as 1 token-equivalent each) |
| KL coefficient (vs reference) | 0.001 (intentionally low: REFLEX-RLVR explicitly *wants* policy drift away from base; standard 0.04–0.1 would over-anchor and prevent support expansion. We sweep {0.001, 0.005, 0.02} in ablations.) |
| Clip ratio (PPO clip) | 0.2 |
| Novelty β schedule | 0 → 0.2 over first 50% of RL steps, then constant |
| Novelty metric | SAE-feature L1 mass on rollout-novel features (see §3.5) |
| NSR coefficient `λ_NSR` (high-conf-incorrect penalty) | 0.5 |
| NSR confidence definition | mean token-margin (sigmoid) on discrete portion of rollout |
| Reward shaping | r = r_correct + r_NSR; r_correct ∈ {0,1}; r_NSR ∈ [-λ_NSR, 0] |
| Optimizer | AdamW, lr 1e-6 (policy), 5e-6 (heads) |
| Warmup | 200 steps linear |
| Batch size (effective) | 256 prompts × G=16 rollouts = 4096 trajectories |
| Total RL steps per cycle | 5000 |
| Cycles | 5 |

### 3.5 SAE for novelty (added in iter-7)

| Hyperparameter | Value |
|---|---|
| SAE architecture | Top-K SAE (Gao et al. 2024 / Anthropic-style) |
| `k` (active features per token) | 64 |
| Dictionary size | 32768 |
| Tap layer | ⌊L · 2/3⌋ (layer 19 of 28 for Qwen2.5-7B; layer 22 of 32 for Llama-3.1-8B) |
| Training tokens | 200M residual-stream tokens from base rollouts on MATH/Codeforces/ARC |
| SAE training cost | ~$200, one-time per base model |
| Re-training | Once at start of cycle 1; *not* re-trained per cycle (the SAE is anchored to the *frozen* pre-cycle-1 base, which is the conceptual reference for novelty) |
| Library | `sae_lens` (open source) — minimizes engineering risk |

### 3.2.1 NSR — Negative Suppression Reinforcement (added iter-7)

The `r_NSR_i` term in §3.2 implements the high-confidence-incorrect penalty. Concrete computation:

```python
def compute_nsr(rollout_i):
    if rollout_i.is_correct:
        return 0.0
    discrete_logits = rollout_i.discrete_logits  # B x T x V
    top1 = discrete_logits.topk(1, dim=-1).values  # B x T x 1
    top2 = discrete_logits.topk(2, dim=-1).values[..., 1:2]
    margin = (top1 - top2).squeeze(-1)  # B x T
    conf_per_token = torch.sigmoid(margin)
    conf = conf_per_token.mean()  # scalar in [0, 1]
    return -lambda_NSR * conf
```

`λ_NSR = 0.5` chosen so that maximum NSR penalty (-0.5) does not exceed correct reward (+1). The NSR term is a **per-trajectory** scalar — it modifies `r_i` directly and flows into `A_i` through the same pathway as `r_correct`, advantage normalization unchanged. We deliberately do not broadcast NSR per-token because high-confidence is a property of the trajectory's discrete portion as a whole, not of any individual token. The discrete-portion `conf_i` reduction-mean already aggregates the per-position margin signal.

Ablation: sweep `λ_NSR ∈ {0, 0.25, 0.5, 1.0}`. Prediction (per external review): `λ_NSR > 0` should preserve diversity better than `λ_NSR = 0`, measurable as larger per-batch distinct-answer count and larger pass@8 / pass@1 ratio.

### 3.3 LDPT translator (formerly "reverse-distillation translator")

The LDPT step uses a translator to map latent trajectories back to discrete CoT for SFT into the base. *(LDPT = Latent-to-Discrete Policy Transfer; see proposal.md §1 phase 3 for the rebrand and §2.7.3 for the policy-improvement framing.)*

**The translator is NOT an external model. It is the same Qwen2.5-7B base policy with a LoRA adapter** — i.e., it shares 99.7% of its parameters with the policy, with only LoRA-rank-64 adapters (~0.3% extra parameters) trained for the translation task. This preserves the self-teacher claim: there is no external neural teacher; the same model's discriminative competence is used as a self-teacher for its own generative competence (per proposal §2.7.3 policy-improvement framing).

**Why a LoRA, not raw self-prompting:** the base model has never seen `<latent>`/`<think>` soft-embedding sequences in pretraining; raw self-prompting would fail to interpret them. The LoRA adapter is *trained on the task of latent-to-discrete translation* — it is the minimal capacity addition needed for the same model to interpret its own latent trajectories. Cost: ~$80 per cycle for LoRA fine-tune; <1% of total budget; preserves teacher-free claim because the LoRA is trained from scratch on our own RL-discovered latent trajectories with verifier-only correctness signal (no external CoTs).

**Reviewer-defense pre-commitment:** if a reviewer challenges the LoRA-as-self-teacher claim, the v1.1 ablation runs translation as raw self-prompting (no LoRA) and compares acceptance rates. Predicted: raw self-prompting < 5% acceptance; LoRA-translator ≥ 0.5 acceptance. The gap demonstrates that the LoRA is *load-bearing for translation specifically* but does not introduce external neural-teacher signal.

**Architecture choice (revised after iter-1 audit):** translator = same family + same scale as the policy (Qwen2.5-7B base) with LoRA, *not* a smaller cross-family model. The translator's job is to emit CoT that the *7B base* can act on; using a 1B translator from a different checkpoint family risks tokenizer drift and style mismatch that would silently inflate the rejection rate. Cost: LoRA-only adds ~$80 per cycle vs ~$60 for the 1B option — negligible for the safety it buys.

| Hyperparameter | Value |
|---|---|
| Base | Qwen2.5-7B (same as policy) |
| LoRA rank | 64 |
| LoRA α | 128 |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| lr | 2e-5 |
| Batch | 64 |
| Epochs per cycle | 2 |
| `λ_len` | 0.05 |
| `λ_KL` | 0.01 |
| `τ_d` (discriminative-validation threshold) | 0.5 |

### 3.4 SFT pass on base (after each cycle)

| Hyperparameter | Value |
|---|---|
| Mix ratio (translated CoTs : original SFT) | 1:3 |
| Original SFT source | Open-Orca + OpenMathInstruct-2 (decontaminated against H_K_eval) |
| lr | 5e-6 |
| Epochs | 0.5 (avoid catastrophic forgetting) |
| KL anchor (vs pre-SFT base) | 0.002 |

**Cumulative-SFT load over 5 cycles.** Per cycle: ~50k accepted translations × 1:3 mix = 200k examples × 0.5 epoch = 100k effective updates. Across 5 cycles: 500k effective SFT updates total. This is well below the typical 1M+ SFT updates of a fresh post-training, but the *targeted* nature of translations means the model is repeatedly nudged toward a narrow distribution. Catastrophic forgetting countermeasures, layered:
1. KL anchor `λ_anchor = 0.002` against the **pre-cycle-1 base** (frozen reference, not the rolling base — prevents anchor drift over cycles).
2. Forgetting-suite eval (architecture §8) gating each cycle's SFT pass.
3. If any reasoning-suite bench regresses ≥2pp in a single cycle, halve `lr` and re-do the pass.
4. If cumulative regression on any bench ≥4pp, abort cycle, fall back to previous checkpoint, and report the saturation point as the natural stopping criterion of the loop.

---

## 4. Loss Functions

### 4.1 Policy loss (per rollout group)

Standard GRPO with our novelty bonus:

```
L_policy = -(1/G) Σ_i [ min( ρ_i · A_i, clip(ρ_i, 1-ε, 1+ε) · A_i ) ]
         + β_KL · KL(π_θ || π_ref)

where ρ_i = π_θ(τ_i|x) / π_old(τ_i|x)
      A_i = (r_i - mean(r)) / (std(r) + δ)  +  β · novelty_i
      novelty_i = KL( P_disc(τ_i) || P_base(τ_i) )
```

`P_disc(τ_i)` is computed by greedy-decoding the latent block to discrete tokens and evaluating π_θ's token-level log-probs over the resulting *fully discrete* sequence. `P_base` is the same under the frozen base.

### 4.2 Halting head loss (iter-D3 fix per audit issue C2: REINFORCE replaced by PPO+value-head consistent with §2.2)

v0.D2 documented three different training regimes for the halting head (CE warm-up §2.2, REINFORCE §4.2, annealed λ_step §2.7.4-proposal). Reconciling to the §2.2 spec which is the actual implementation:

**Stage 1 (steps 0–500): supervised warm-up.** Cross-entropy against the heuristic target `halt = 1 iff ‖Δh_s‖ / ‖h_s‖ < 0.05 for two consecutive steps`:
```
L_halt_warmup = -E_τ Σ_s [ y_target_s · log σ(c_halt(h_s)) + (1 - y_target_s) · log(1 - σ(c_halt(h_s))) ]
```
where `c_halt(h_s)` is the halting logit and `y_target_s` is the heuristic-target binary label.

**Stage 2 (steps 500+): PPO with value head.** Halting head shares the policy's PPO objective with its own clip ratio 0.1 and a 2-layer MLP value head `V(h_s) → R` (Huber loss against returns). Per-trajectory reward `R_τ = r_correct_τ - λ_step(t) · n_halt_steps_τ`, where `λ_step(t)` is the annealed schedule from §3.1; advantage `A_s = R_τ - V(h_s)`; standard advantage normalization within rollout group.
```
L_halt_PPO = -(1/G) Σ_τ Σ_s [ min( π_halt^new/π_halt^old · A_s, clip(π_halt^new/π_halt^old, 0.9, 1.1) · A_s ) ]
L_value    = E_τ Σ_s [ Huber(V(h_s) - R_τ) ]
L_halt     = L_halt_PPO + c_v · L_value     # c_v = 0.5
```

**No separate REINFORCE form.** v0.D2's `L_halt = -E_τ [ log P(halt) · (r_τ - λ_step · n_steps) ]` is removed — it had high variance on sparse binary reward and contradicted the §2.2 PPO+critic spec.

**Pre-registered diagnostic:** halting-entropy plot (§5.2.1) flags short-circuit collapse; if halting head reduces to "always halt at S=2" by cycle 2, increase warm-up duration and reset (per proposal §2.7.4).

### 4.3 ~~Criticality head loss~~ (removed — criticality head pre-cut per proposal §7.5.0; iter-16 decision)

The criticality head was pre-cut from the headline method. The fixed cosine-anneal noise schedule (§2.3) replaces it. The L_crit + verifier-proxy infrastructure ($120 budget line) was removed in iter-17. This subsection retained as a marker; reverify in v1.1 only if reviewers explicitly request the learned-criticality variant.

### 4.4 Translator loss

```
L_T = -E[log P_base(answer | y)]
    + λ_len · max(0, |y| - 2·S)
    + λ_KL · KL( T_ω(·|x) || π_base(·|x) )
```

The first term is the equivalence objective (translated CoT must induce correct base-model answer). The second penalizes excess length. The third regularizes style.

### 4.5 SFT loss on base (post-cycle)

Standard cross-entropy on the mixed batch:

```
L_SFT = -E_{(x,y)~mix} [ Σ_t log π_θ(y_t | x, y_<t) ]
      + λ_anchor · KL( π_θ || π_θ_pre_cycle )
```

`λ_anchor = 0.002` enforces minimal drift on non-target distributions.

---

## 5. Training Pipeline

### 5.1 Stages per cycle

```
[Stage 1: Mining]      pass@1024 of current base on candidate problems  → H_K
[Stage 2: Pilot]       100-step RL pilot with new base, sanity-check reward signal
[Stage 3: Main RL]     5000 GRPO steps with hybrid latent–discrete generation
[Stage 4: Trajectory   collect successful (x, latent-trajectory) pairs
          collection]
[Stage 5: Translator]  train T_ω for 2 epochs on collected pairs
[Stage 6: Reverse SFT] 0.5-epoch SFT on (x, T_ω(x)) ∪ original SFT
[Stage 7: Eval]        pass@k on H_K_eval, log all metrics
```

### 5.2 Async architecture and off-policy correction

We adopt the asynchronous TBA pattern (Bartoldson et al., NeurIPS 2025) for Stage 3:

- **Rollout workers** (1× node, 8×H100 in main run; for ablations 1× node, 4×H100): generate hybrid latent–discrete rollouts in vLLM with custom hooks. Workers serve weights from the most recent checkpoint, refreshed every 100 trainer steps.
- **Reward workers** (CPU-bound): verify correctness via:
  - Math: SymPy (≤ 5 ms/check) + Lean4 kernel for proof-style problems (≤ 8 s/check, batched).
  - Code: sandboxed execution (Docker, gVisor) on hidden test sets (≤ 2 s/check at parallelism 64).
  - ARC-AGI-2: programmatic grid comparison (≤ 1 ms/check).
- **Trainer** (1× node, 8×H100): consumes rollout queue, runs GRPO updates, periodically pushes new weights to rollout workers.

**Verifier-throughput budget (refined iter-18).** With 256 prompts × 16 rollouts = 4096 verifications per RL step. Math (~50% of mix): 4096 × 0.5 / 200-way-parallel-SymPy ≈ 10s. Code (~30%): 4096 × 0.3 × 2s / 64-parallel ≈ 38s. Lean (~15%): 4096 × 0.15 × 8s / 32-parallel ≈ 154s — this is the bottleneck.

**Lean verifier pool sizing (added iter-18).** Each parallel Lean4 kernel needs ~2 GB resident memory (kernel + cached env) and 1 vCPU. 32-way parallel = 32 vCPU, 64 GB RAM. Provisioned as a single c7i.16xlarge (or equivalent on-prem 64-core/128-GB box) at ~$2.85/hr on-demand. RL-step wall time runs ~10 hr/cycle × 5 cycles = 50 hr → $143 for the Lean pool. Already counted under the "API costs / verifier sandbox" budget line; itemized here for transparency.

*Mitigation stack:* (i) Lean problems run on the dedicated 32-vCPU pool with one-step kernel cache (15% L1 hit rate empirically on similar workloads); (ii) trainer overlaps next-batch rollout with verifier processing of current batch; (iii) Lean fraction *hard-capped* at 15% of mix (enforced by curriculum sampler); (iv) any individual Lean check exceeding 30s is killed and the trajectory marked as "verifier-timeout" → reward 0 (treated as incorrect; reported as a separate failure-mode statistic). End-to-end target: ≤180s verifier-wall-time per RL step, under the ≤200s training-step time at batch 256.

**Async pipeline (PipelineRL-style; Bartoldson et al. NeurIPS 2025 TBA pattern).** Critical for budget: synchronous rollout-then-train would consume the full $7K budget on rollout-side inference before cycle 3 (per external critique iter-C1). We adopt PipelineRL (TMLR 2026) async pattern: rollout workers and trainer run concurrently with off-policy correction. This gives ~3× wall-clock speedup over sync-RL on RL-for-LLM workloads at the 7B scale per published TMLR benchmarks.

**Off-policy correction.** Because rollout workers lag the trainer by up to 100 steps, rollouts are off-policy. We apply standard PPO importance sampling with `ρ = min(π_θ(τ)/π_old(τ), c)`, clip ratio `c = 4.0` (asymmetric, capped above to prevent variance explosion). Trajectories with `ρ < 0.1` or `ρ > 4.0` for any token are dropped from the gradient (no zeroing — full discard, with a logged drop rate as a freshness diagnostic). Target drop rate: ≤15%. If exceeded, lower the worker-trainer lag.

**Novelty–IS interaction.** A risk: the most-novel-but-correct trajectories are likely to have *high* `ρ` (the trainer has moved toward them between rollout and update); naively dropping them throws out the most informative gradient. Mitigation: trajectories with `ρ > 4` AND `r_τ = 1` AND `novelty > novelty_p90` (90th percentile of novelty in the current batch) are not discarded but instead receive a *re-rolled gradient* — we recompute the action log-probabilities under the current policy on the same trajectory tokens (pure `ρ = 1` for the gradient term, but reward stays attached to the original trajectory). This is a controlled break of strict importance correctness; we log how often it fires (target ≤2% of trajectories) and report sensitivity in ablations.

vLLM serves the latent register via a custom token-aware decoder fork (~500 LoC patch); we ship this as part of the open-source release.

### 5.2.0 Thought Trace visualization pipeline (added iter-D2 per Gemini feedback)

For the AIME-2026 named-problem smoking gun (proposal §"The headline smoking-gun"), we generate a **Thought Trace** figure — the conference-talk centerpiece showing the transition from the noisy continuous latent register back to the discrete output that reaches the correct answer.

**Pipeline:**

```python
def render_thought_trace(model, sae, x_smoking_gun, layer_tap, S_max=32):
    """
    Generate the Thought Trace figure for a single AIME-2026 smoking-gun rollout.
    Output: a 4-row figure for the paper / talk slide.
    """
    # 1. Sample one successful rollout (verifier-confirmed correct).
    rollout = sample_until_correct(model, x_smoking_gun, max_attempts=64)
    # rollout = {'prompt_ids', 'latent_hiddens' (S, d), 'noise_eps' (S,), 
    #            'discrete_tail_ids', 'answer_token', 'verifier_pass': True}
    
    # 2. ROW 1: noise schedule (cosine-anneal). 
    #    Plot eps_s for s = 0..S; matches §2.3 schedule.
    
    # 3. ROW 2: SAE-feature heatmap inside the latent register.
    #    For each latent step s, encode hidden_s with the SAE; plot a 
    #    K_dict × S binary matrix of which features fire (top-64 active).
    #    Highlight features that DO NOT fire on the base's failed CoT for 
    #    the same problem (these are the "novel" features § proposal 5.5).
    feature_matrix = []
    for s in range(S_max):
        active_features = sae.encode(rollout['latent_hiddens'][s]).top_k_indices(64)
        feature_matrix.append(active_features)
    
    # 4. ROW 3: Latent-to-token greedy projection at each step.
    #    Greedy-unembed each h_s to get the nearest token at that step.
    #    This shows what the latent state "would have generated" if forced 
    #    discrete -- typically a non-coherent shadow that nevertheless 
    #    reveals the underlying reasoning skeleton.
    nearest_tokens = []
    for s in range(S_max):
        proj = model.unembed(rollout['latent_hiddens'][s])
        nearest_tokens.append(model.tokenizer.decode(proj.argmax()))
    
    # 5. ROW 4: discrete tail + answer.
    #    The </think> token, the discrete CoT tail (typically 50-200 tokens),
    #    and the final answer with verifier checkmark.
    discrete_tail_text = model.tokenizer.decode(rollout['discrete_tail_ids'])
    
    # 6. Optional ROW 5: LDPT-translated discrete CoT (run T_omega on the 
    #    rollout post-hoc). Shows what the LDPT translator would have emitted 
    #    for SFT signal back into the base.
    translated_y = T_omega.translate(rollout['latent_hiddens'], x_smoking_gun)
    
    return assemble_figure(
        noise_schedule=rollout['noise_eps'],
        feature_matrix=feature_matrix,
        nearest_tokens=nearest_tokens,
        discrete_tail=discrete_tail_text,
        translated_cot=translated_y,
        problem_x=x_smoking_gun,
        verifier_result=rollout['verifier_pass'],
    )
```

**What the figure communicates** (the conference-talk one-slide story):
1. Top: cosine-annealed noise — visually decreasing.
2. Middle-top: SAE-feature heatmap — showing distinct feature clusters firing in early-noisy steps that do NOT appear in the base's failed CoT (highlighted in red).
3. Middle-bottom: nearest-token projections — typically reveal the latent's "shadow reasoning" (e.g., for an AIME geometry problem: tokens like "perpendicular," "circumcenter," "Power of a Point" appearing in the latent block even though these never appear in any base discrete sample).
4. Bottom: discrete tail with the correct answer + verifier checkmark.
5. Optional: LDPT translation showing what the base will be SFT'd on, closing the loop.

**Why this matters for oral selection:** Gemini's recommendation (iter-D2): "an oral talk is 12 minutes; a memorable single example dominates the audience's takeaway." The Thought Trace figure is the paper's Figure 2 and the talk's slide 4 — the visual that AC committees remember when allocating oral slots. Combined with the 1M-sample base-fails-everywhere headline (§5.5.2.5), the audience leaves with one image: "REFLEX-RLVR went *through the latent space* and came back with an answer the base couldn't find at 1M samples."

**Cost:** ~$10 (one rollout × one SAE encoding pass × one LDPT translation × matplotlib). Buffer-funded.

**Pre-registered:** the Thought Trace figure is generated at v1.0 (post-main-run) for the named smoking-gun problem; if the named-smoking-gun pivot fires (no AIME-2026 problem with `pass@1M(base) = 0` and `pass@1(REFLEX) ≥ 0.5`), the Thought Trace is generated for the next-best AoPS-popular problem per the §"Pre-registered selection rule for the named smoking-gun problem" of proposal.md.

### 5.2.1 Diagnostic logging (entropy plots, per proposal §2.7.3 and §2.7.4)

Two entropy plots required for the paper appendix:

```python
# Latent first-step entropy (proposal §2.7.3)
def latent_first_step_entropy(model, problems_100):
    # 100 held-out problems × 32 latent rollouts each = 3200 samples per measurement
    entropies = []
    for x in problems_100:
        rollouts = sample_latent(model, x, n_rollouts=32, n_steps=1)  # step-1 only
        for h_step1 in rollouts:
            p_token = softmax(unembed(h_step1))
            entropies.append(-(p_token * log(p_token + 1e-10)).sum())
    return mean(entropies)

# Halting entropy (proposal §2.7.4)
def halting_entropy(rollout_groups_256):
    halt_step_dist = defaultdict(int)
    for group in rollout_groups_256:
        for traj in group:
            halt_step_dist[traj.halt_step] += 1
    p = normalize(halt_step_dist)
    return -(p * log(p + 1e-10)).sum()
```

Both logged once per cycle (5 measurements total). Pre-registered thresholds per proposal:
- Latent entropy: alarm if drops > 30% cycle-to-cycle.
- Halting entropy: alarm if drops below 0.3 nats by cycle 2.

Cost: ~$50 total across all measurements (in budget buffer).

### 5.3 Determinism and seeds

- Three seeds per training condition (37, 1337, 31415).
- All randomness (rollout sampling, noise injection, data shuffling) seeded.
- Reward verification is deterministic by construction.

---

## 6. Data Pipeline

### 6.1 Sources

| Source | License | Approx. # problems | Used for |
|---|---|---|---|
| MATH (Hendrycks et al.) | MIT | 12500 | Mining + RL |
| AIME 1983–2024 | Public | 1230 | Mining + RL |
| AIME 2025 | Public | 30 | Held-out eval |
| AIME 2026 | Public (post-cutoff) | ~30 | Held-out eval (decontaminated) |
| HMMT 2018–2024 | Public | ~280 | Mining + RL |
| HMMT 2025 | Public | ~40 | Held-out eval |
| Putnam 1980–2024 | Public | 1080 | Mining + RL |
| LiveCodeBench (Pre-2026Q1) | MIT | ~1100 | RL |
| LiveCodeBench (2026 Q2+) | MIT | ~150 | Held-out eval |
| Codeforces div2 2020–2025 | Scraped, fair use | ~3000 | Mining + RL |
| ARC-AGI-2 train | Apache-2.0 | 1000 | Mining + RL |
| ARC-AGI-2 eval-private | API | 400 | Held-out eval |
| DeepSeek-Prover-V2 generated theorems | MIT | ~50000 *available*; **≤7500 used** to keep Lean fraction of RL mix ≤15% (see verifier-throughput budget §5.2) | RL (Lean-verified) |
| OpenMathInstruct-2 | NVIDIA-open | ~14M | Original SFT mix |
| Open-Orca | Apache-2.0 | ~4M | Original SFT mix |

### 6.2 Decontamination

For each (mining + RL) source we run an n-gram overlap check (n=8) against:

- The base model's pretraining corpus (proxy: the model's perplexity on the problem statement is a noisy signal; we use the more conservative n-gram check against publicly known portions: RedPajama-v2, FineWeb-Edu, StackV2-Edu).
- Held-out evaluation sets.

We retain only problems whose 8-gram overlap with the eval set is < 1%.

**AIME 2026 / HMMT 2026 leakage.** AIME 2026 was held 2026-02-04. Today is 2026-05-03. Web-scraped solutions could already be in any FineWeb-Edu update or in DeepSeek-Prover-V2 generated theorems pulled after 2026-02-04. Mitigation: (a) Qwen2.5-7B base has cutoff 2024-09 — far before AIME 2026 — so the *base* itself is uncontaminated; (b) we filter our SFT-mix corpora to versions snapshotted ≤ 2025-12-31; (c) we re-verify that base `pass@1024 = 0` on AIME 2026 problems, which would be impossible if solutions had leaked into the *base*; (d) we run a manual spot-check of base CoTs for any AIME-2026 problem where `pass@1024 > 0` and would exclude that problem if solution-recall is observed.

### 6.3 Hard-set construction

For each candidate problem `x`, run base model: 1024 samples, T=0.8, top-p=0.95, verify each. Retain `x` only if 0/1024 are correct. Periodically (every 2 cycles) refresh against the *current* base, since SFT cycles change the base.

### 6.4 Storage and serving

- Cached activations / hidden states for ablation efficiency: ~2 TB on local NVMe. R2 cold backup.
- Rollout database: Postgres on a single VM, ~5 GB per cycle.
- Versioned checkpoints: Hugging Face private repo + S3.

---

## 7. Infrastructure

### 7.1 Compute layout

| Stage | Hardware | Purpose |
|---|---|---|
| Main RL | 1× RunPod 8×H100 SXM5 (trainer) + 1× 8×H100 (rollout) | GRPO + vLLM |
| Mining | 1× 8×H100 (vLLM batch inference) | pass@1024 sweeps |
| Translator | 1× 4×A100 80GB | LoRA fine-tune |
| Eval | 1× 4×H100 | pass@k on held-out |
| Ablations (1.5B) | 1× 4×H100 | Smaller-scale sweeps |
| Code-execution sandboxes | 64-vCPU CPU box | Reward verification |

### 7.1.1 Numerical precision in the latent register

Soft-embedding feedback over up to 32 steps risks accumulated FP16 error (especially the noise-injection variance, which compounds under non-linear FFNs). Concrete plan:

- Forward pass through the *latent* block: BF16 weights, **FP32 residual stream** (cast residual to FP32 around RMSNorm and back). This adds ~7% memory and ~3% latency vs all-BF16, but matches the precision used by Megatron-LM for >2k-step training stability.
- Outside the latent register: standard BF16 mixed-precision.
- Noise tensor `n_s ~ N(0, I)` sampled in FP32 then cast to BF16 for the addition.
- We will run a precision-ablation: FP16 / BF16 (residual BF16) / BF16 (residual FP32) and report drift in `‖h_s‖` over 32 steps. If FP16 accumulates >1.5× drift over BF16 (residual FP32), we will not consider FP16 results valid.

### 7.2 Software stack

- Training framework: TRL 0.10+ with custom GRPO subclass (`HybridLatentGRPOTrainer`).
- Inference: vLLM 0.6+ with our custom decoder fork.
- Data: Hugging Face `datasets`; Polars for analytics.
- Reward: SymPy 1.13+, Lean 4 + LeanDojo, Docker/gVisor sandbox.
- Experiment tracking: Weights & Biases.
- Checkpoint storage: HF Hub private + S3.
- CI: GitHub Actions for unit tests + pre-merge integration smoke tests.

### 7.3 Open-source release

At paper submission:

- Code: Apache-2.0, single repo `reflex-rlvr/` with `training/`, `inference/`, `eval/`, `mining/`, `translator/`.
- Checkpoints: REFLEX-RLVR-Qwen2.5-{1.5B, 7B, 14B} and matched baselines.
- Hard-set: `H_K` with provenance metadata.
- Eval harness: reproducible pass@k pipeline with seeds.
- Documentation: training-from-scratch quickstart, inference quickstart.

---

## 8. Engineering Risks and Mitigations

| Risk | Mitigation |
|---|---|
| vLLM does not natively support soft-embedding feedback | We maintain a small fork; if upstream rejects, ship as patch. ~500 LoC. |
| Reward sandbox throughput bottlenecks RL | Pre-stage reward checks; parallel sandboxes; cache test-case execution per problem. |
| Catastrophic forgetting during reverse SFT | KL anchor + small epoch + 1:3 mix ratio; monitor on a *reasoning-relevant* suite each cycle: MATH-500, AIME-2024, BBH (algorithmic), MMLU-Pro (knowledge), GPQA-Diamond, HumanEval+. GSM8K and MMLU are excluded as saturated/insufficiently sensitive. Forgetting threshold: any single bench > 3pp regression triggers a rollback or λ_anchor increase. |
| Latent block KV-cache memory blow-up | Cap S_max=32; soft embeddings re-attend at full cost — accepted; alternative: gradient checkpointing within latent block. |
| Translator equivalence collapse | Multi-objective; if `pass@4(base | x ⊕ y) ≥ 0.5` acceptance rate falls below 0.2 across a cycle, stop the cycle, increase translator capacity (LoRA rank 64 → 128) and/or extend translator training to 4 epochs. |
| Cross-base difference (Qwen vs Llama) | Run both; report any base-specific effects honestly. |
| Numerical instability in GRPO + novelty term | Clip novelty bonus at 1.0; gradient clipping max-norm 1.0. |

---

## 8.1 Compute envelope and contingency staircase (relocated from proposal.md in iter-11; renumbered in iter-14)

Compute prices anchored at RunPod / Lambda spot rates as of 2026-Q1: H100 SXM5 ≈ $2.39/hr, A100 80GB ≈ $1.19/hr.

| Item | Hardware | Hours | Cost |
|---|---|---|---|
| Hard-set mining: initial pass@1024 on 50k problems + 2 delta re-mining passes | 8×H100 | 140 | $2,680 |
| Main RL run (Qwen2.5-7B, ≤25k steps with early stop) | 8×H100 | 220 | $4,200 |
| Translator training (5 cycles, Qwen2.5-7B + LoRA) | 4×H100 | 50 | $480 |
| Translator validation (8× base generations × ~50k accepted candidates × 5 cycles ≈ 2M base generations; iter-14 tightened from pass@4 to pass@8) | 8×H100 | 36 | $690 |
| Teacher-translator ablation (frontier API, 200 problems) | API | — | $300 |
| SAE training for novelty metric (one-time) | 4×A100 | 40 | $190 |
| SFT passes on base (≤5 × short) | 8×H100 | 25 | $480 |
| ~~Verifier-proxy training~~ — **removed iter-17:** verifier proxy was the gradient target for the criticality head, which was pre-cut in iter-16. No remaining use case. | — | — | $0 |
| Curriculum LSR computation (per-cycle) | 8×H100 | 30 | $575 |
| Ablations: 9 conditions × 2 seeds at 1.5B + pass@k≤64 eval | 4×H100 | 180 | $1,720 |
| Final headline evaluation (pass@1024 on 700-problem pool) | 4×H100 | 40 | $190 |
| **Pass@1,048,576 on the named smoking-gun problem (proposal §5.5.2.5)** | 8×H100 | 28 | $80 |
| **Diagnostic entropy logging (latent first-step + halting; proposal §§2.7.3, 2.7.4)** | 8×H100 | 18 | $50 |
| **Cycle-1 forgetting-suite eval (PSR-diversity-collapse early gate; proposal §2.7.3 item 4)** | 8×H100 | 140 | $400 |
| **FIPO baseline replication on Qwen2.5-7B (1 seed; FIPO arXiv March 2026)** | 8×H100 | 140 | $400 |
| **AIME-2026 hard-set mining (30 problems × pass@4096)** | 8×H100 | 12 | $30 |
| API costs (synthetic Lean theorems, problem statements) | — | — | $400 |
| Buffer (~5%) | — | — | $500 |
| **Total nominal** (iter-D3 audit fix: actual line-item sum) | | | **$13,365** |

**Iter-D3 reconciliation per audit-round-2 issue C3.** v0.D2 claimed $12,405 nominal but the line items actually sum to $13,365 (audit caught the $960 arithmetic error). The narrative "$5–8K envelope" applies only *after* the contingency staircase fires. Reconciled budget below:

**Contingency staircase** — applied in priority order, pre-committed before launch, reported in the paper:

| Step | Trigger | Action | Savings |
|---|---|---|---|
| C1 | always | Drop 14B confirmation; report 1.5B + 7B scaling only | $1,500 |
| C2 | always | Reduce ablations to 9 cond × 2 seeds (already reflected in nominal) | (counted) |
| C3 | always | Reuse SAE trained for novelty as primitive-coverage SAE | $0 |
| C4 | conditional (forecast P~40%) | Early-stop fires at cycle ≤3 → 15k RL steps | $1,500 |
| ~~C5~~ | **PROMOTED to PRIMARY (per external-critique iter-C1)** — Llama-3.1-8B cross-base is now unconditional. A "Qwen-only" result risks rejection as Qwen-pretraining-data artifact (Reviewer rF4g, OpenReview 2026). Net cost reallocation: drop ablation P (layer-sweep) and ablation L (Fréchet vs SVD math, already justified theoretically) at $390 saved → applied to Llama-cross. | $0 (reallocated) |
| C6 | conditional | LSR curriculum every-other-cycle | $400 |
| C7 | additive (only if buffer ≥ $1,200 after primary contingencies) | Run Qwen2.5-7B-Instruct secondary (proposal §1.6.2) | (cost: +$1,200, not savings) |

**Expected total under contingency (iter-D3 reconciled with actual nominal $13,365):**
- Nominal $13,365.
- C1 fires (always): drop 14B confirmation. − $1,500. Subtotal $11,865.
- C4 fires (P~40%, conditional): early-stop ≤ cycle 3. − $1,500. Subtotal $10,365.
- C6 fires (conditional): LSR every-other-cycle. − $400. Subtotal $9,965.
- **Realistic-middle case:** ~$10,000 (was previously claimed $7,200 — that number was wrong).
- **Aggressive-case (all reactive contingencies fire + drop ablations to 1 seed):** ~$8,000.
- **Floor:** ~$6,500 (drop Llama-cross AND ablations limited to LSR-only baseline).

**Honest budget statement (iter-D3 narrative reconciliation):** the proposal's "$5–8K envelope" wording is updated to **"$8–10K envelope at realistic-middle; $13K nominal absent contingency."** This is the honest range. The $5–8K wording in older drafts (proposal §3) is anachronistic and is corrected in v0.D3.

## 9. Engineering Milestones

| Month | Deliverable | Compute spend (cumulative) |
|---|---|---|
| 1 | vLLM fork passes unit tests; halting head + cosine-anneal noise schedule stable on toy tasks (criticality head pre-cut per proposal §7.5.0; not implemented in v1); **Week-1 premise pilot result** (go/no-go gate) | $250 |
| 2 | Hard-set mining complete on Qwen2.5-1.5B; verifier-proxy AUROC ≥ 0.85; 100-step RL pilot positive | $800 |
| 3 | One full cycle at 1.5B; pass@k crossover plot reproduces Yue et al. on baselines | $1700 |
| 4 | Five cycles at 1.5B (or earlier convergence); ablations 1–6 complete; verifier-proxy retraining stable across cycles | $3200 |
| 5 | Main 7B run launched; cycle-1 forgetting-suite check passed; mid-run sanity check | $5000 |
| 6 | Main 7B run complete (or early-stopped); all ablations at 1.5B done; mechanistic study underway | $7000 |
| 7 | (Per contingency staircase) Llama-3.1-8B cross-base run OR deferred to v1.1 | $7600 |
| 8 | Paper draft, open-source release prep | $8000 |

---

## 10. What is *Not* in Scope (To Stay Within Budget)

- Pretraining a model from scratch.
- Multimodal extension (Year 2+).
- Theoretical proofs of capacity expansion (we provide empirical evidence and a mechanistic story; theoretical companion paper is Year 3).
- Distributed RLHF beyond GRPO + TBA pattern (e.g., PPO with separate critic).
- Frontier-scale (>14B) confirmation.

These are deliberate scope cuts to ensure the core scientific question — *can we expand capacity teacher-free?* — gets a clean, reproducible answer within the budget envelope.
