# REFLEX-RLVR

**Breaking the Base-Model Reasoning Ceiling via Latent-Space Exploration and Latent-to-Discrete Policy Transfer (LDPT)**

*(LDPT is the published name for the contribution earlier called "reverse self-distillation through discriminative-to-generative transfer" in internal drafts. The mechanism is the same; the rename addresses an external-reviewer concern that "reverse self-distillation" sounded like vanilla CoT distillation. Theoretical framing per Jolicoeur-Martineau 2025: LDPT is a policy-improvement operator that converts latent-discriminative competence into discrete-generative capability.)*

> **iter-D7, 2026-05-04 — PAPER v0.5 STATUS NOTE.** Adds Llama-3.1-8B base cross-family check (behavioral + logprob, 2 new measurements; total **11 independent rejections** across **2 model families**). Both Llama runs FAIL the gate; Cliff's $\delta \leq 0.022$. Independence framing for the discrimination/exploration ceilings strengthened in §1.4 + §5.3 (3 operationally distinct senses of independence, made explicit). Cumulative spend: \$14.94 / \$15 ceiling.
>
> **iter-D6, 2026-05-04 — PAPER v0.4 STATUS NOTE.** Adds Qwen2.5-{1.5B,7B}-Instruct contrast (post-hoc, 4 new measurements; total 9 independent rejections of the asymmetry premise). Instruct cleanly separates emission strength (RECOVERED: $p_{\rm disc}({\rm oracle})$: 0.000$\to$0.659 at 7B-base$\to$Instruct) from discrimination strength (NOT recovered: $\Delta$ stays $\leq 0.012$ across all base/Instruct/scale combinations). Cliff's $\delta + $ Cohen's $d + $ power analysis added (`figures/output/effect_sizes_and_power.csv`); Figure 4 per-problem $\Delta$ histograms added. Paper now 301 lines with new §5 Analysis section. Cumulative Modal spend: \$14.18 / \$15 ceiling. NeurIPS 2027 main-track target.
>
> **iter-D5, 2026-05-04 — PAPER v0.3 STATUS NOTE.** The Week-1 premise pilot specified in §1.7 of this document has been run, the negative-result paper drafted (`paper/paper.md` v0.3, ~324 lines, NeurIPS-style), 3 figures rendered with 95% bootstrap CIs (`figures/output/`), and a post-hoc logprob test added (per paper §4.5). NeurIPS 2027 main-track target. The §1.7 pivot below remains the load-bearing reference for the pivot decision.
>
> **iter-D4, 2026-05-03 — POST-PILOT STATUS NOTE.** The Week-1 premise pilot specified in §1.7 of this document has been run. **Both AIME H_K (51 problems, mean p_disc(oracle) = 0.012, delta = 0.000) and the synthetic memorization control (50 problems, mean p_disc(oracle) = 0.160, delta = −0.005) FAIL all three pre-registered sub-criteria of gate (a).** Both pre-registered failure modes from §1.7 trigger (mean p_disc(oracle) << 0.5 AND delta << 0.2). A post-hoc Qwen2.5-7B base scale check (NOT pre-registered) shows the same failure pattern more strongly (mean p_disc(oracle) = 0.000, delta = −0.0147). Memorization control consistent. Per the LOCKED §1.7 pivot rule, the project re-frames as a negative-result paper sharpening Yue et al.; the 7B headline run is NOT proceeding under the original framing. The headline submission is now `paper/paper.md` v0.2 (the negative-result paper). The proposal text below is preserved as the as-designed pre-registration record. Headline numbers and decision JSON are at `results/pilot/gate_a_decision.{json,md}`. Total pilot spend: ≈ $13.

> **Document scope.** This is the *project proposal* for a research program targeting a NeurIPS Main Conference submission. The headline paper is a single contribution with the experiments specified in §5; the broader 3-year program (§7) is academic-program context, not paper content. The paper itself is intended to fit the standard NeurIPS 9-page format.

---

## The headline smoking-gun: a specific AIME 2026 problem

Per external critique ("AIME 2026 was Jan 2026... post-pretraining competition problem... guaranteed Oral"): we pre-commit to an **AIME-2026-specific** smoking-gun protocol.

**Pre-registered headline:** REFLEX-RLVR solves at least one specific AIME 2026 problem (held-out, post-Qwen2.5 pretraining cutoff Sep 2024) at `pass@1 ≥ 0.5`, where Qwen2.5-7B-base scores `pass@1,048,576 = 0` (verified by the §5.5.2.5 1M-sample escalation). FIPO and DeepSeek-R1-Distill-Qwen-7B both also fail this problem at `pass@1024`.

**Scope:** AIME 2026 problems are 30 total (AIME I and AIME II, Feb 2026, 15 problems each). We pre-mine all 30 against Qwen2.5-7B-base at `pass@4096`; the subset with `pass@4096 = 0` (predicted ≥ 25 of 30 based on prior-year statistics) is our AIME-2026-hard set.

**Pre-registered selection rule for the named smoking-gun problem** (locked before any REFLEX-RLVR result is observed; iter-D3 fix per audit issue C4 — step 2 of v0.D2 selected on the dependent variable [highest REFLEX pass@4096], which is post-hoc p-hacking. Inverted below to use only AoPS engagement [observed in Feb 2026, pre-locked]):

1. From the AIME-2026-hard subset, restrict to problems whose AoPS thread (Art of Problem Solving forum) had ≥ 100 distinct posters in the 2 weeks following release (Feb 4–18, 2026). This is the *external operationalization* of "famous / most discussed," and is fully observable before any REFLEX training.
2. **(Iter-D3) Pre-commit to the single problem with the HIGHEST AoPS engagement count among the AIME-2026-hard subset** — i.e., the most-discussed problem that the base also fails on. This selection is on the *independent* variable (AoPS engagement) only. Snapshot the AoPS post counts on 2026-04-30 (≥4 days before paper deadline pre-registration) and freeze. The selected problem is locked in `smoking_gun_problem.txt` before any 7B training begins.
3. Verify base `pass@1,048,576 = 0` on that single problem (proposal §5.5.2.5 protocol).
4. **Pre-registered fallback ladder:** if step (3) shows base solves at any of 1M attempts on the AoPS-rank-1 problem, we drop down to AoPS-rank-2, then rank-3, and so on, *all pre-committed in the same `smoking_gun_problem.txt` ranking*. We do *not* re-rank by REFLEX performance.
5. If REFLEX-RLVR fails to achieve `pass@1 ≥ 0.5` on the AoPS-rank-1 problem (after step 3 verifies base = 0), we honestly report the failure on the rank-1 problem and additionally report the broader 700-pool gain. We do *not* "shop down" the AoPS list looking for a problem REFLEX happens to solve — that would be exactly the post-hoc selection error we removed.

**Why the inversion matters.** A reviewer's attack on v0.D2 step 2 wrote itself: "you ran on 30 AIME problems, picked the one your method scored best on, and called it the smoking gun." Step 2 of v0.D3 picks on AoPS engagement (an independent, externally measurable variable observed in Feb 2026, before our pretraining cutoff for any new training data); the headline becomes "we solve the most-famous AIME-2026 problem at pass@1 ≥ 0.5," which is a sharp, falsifiable, pre-registered claim — not "we solve the AIME-2026 problem we happened to do best on." If the AoPS-rank-1 problem turns out to be the hardest one for REFLEX, that is the honest result.

**Falsifiability strengthened by the inversion.** In v0.D2, the smoking gun was guaranteed to "succeed" because we picked our best problem. In v0.D3, the smoking gun fails if we miss on the AoPS-rank-1 problem — which is a real risk and a real falsifier. The §5.5.1 marginal-tier outcome ("no AIME-2026-named win, broader 700-pool gain only") becomes a substantive scientific finding rather than a hidden disappointment.

**Statistical honesty:** the AIME-2026-named-problem result is *narrative* — n=1 on a specific problem. The *statistical* claim is the broader 700-problem `Δ pass@1024` gain (§5.0). We frame the named problem as the conference-talk hook, the 700-pool gain as the headline statistical result.

This is the "guaranteed Oral" path the external critique identified. If REFLEX-RLVR cannot solve any AIME-2026-hard problem at pass@1 ≥ 0.5, the headline downgrades to the broader 700-problem pool gain (still publishable but not Oral-worthy).

## The general smoking-gun protocol (broader 700-problem pool) 
A NeurIPS reviewer remembers one concrete example longer than ten Pareto curves. Our headline gain (`Δ pass@1024 ≈ 0.10` across 700 problems) is summary statistics; we will additionally identify and tell the story of **one specific named problem** that REFLEX-RLVR solves and that the base + every baseline does not. Concrete protocol:

1. From `H_K_eval` (700 problems, all `pass@4096(base) = 0`), identify the subset our method solves at `pass@1024(REFLEX-RLVR) ≥ 0.5`.
2. Pick the single problem with the highest *fame* (preferring named competition problems: a specific AIME 2026 problem, a specific HMMT 2026 problem, a known Putnam problem, a specific ARC-AGI-2 task with a published difficulty score).
3. Trace the latent register's solution end-to-end: the cosine-anneal noise pattern, the SAE features that fire, the translated discrete CoT, the human-readable explanation.
4. This case study becomes Figure 2 of the paper and the centerpiece of the conference talk.

**Why this matters:** an oral talk is 12 minutes; a memorable single example dominates the audience's takeaway. Yue et al. (NeurIPS 2025 Best Paper Runner-Up) is remembered for the pass@k crossover *plot*, not the aggregate numbers. REFLEX-RLVR should be remembered for the named problem we cracked.

**Risk:** if no single famous-named problem clears the bar, we fall back to the most striking *category* of problem (e.g., "REFLEX-RLVR solves 8 of 30 AIME 2026 problems that the base solves 0 of 4096 on; we walk through one"). Either is publishable; the named-problem version is stronger.

## The single most distinctive thing

If a reviewer remembers one thing about this paper, it should be: **a base LLM is a stronger discriminator of reasoning chains than a generator, and REFLEX-RLVR is the recipe that converts that discriminator into a generator on previously unsolvable problems, using nothing but a verifier and the model itself.** The latent register is the exploration mechanism; reverse self-distillation through the discriminator is the transfer mechanism; the cycle is the bootstrap. No external neural teacher.

## Submission target

NeurIPS 2026 Main Conference (submission deadline mid-May 2026; we are on track for the *2026 cycle* with the Week-1 pilot in early May). If the headline `Δ pass@1024` falls in the *modest positive* tier (per §5.5.1), we additionally consider ICLR 2027 as a re-submission target. The work is camera-ready in 8 months from kickoff per `architecture.md §9` milestones.

## Camera-ready paper synopsis (what the NeurIPS submission will look like)

- **Title.** REFLEX-RLVR: Self-Teacher Capacity Expansion via Latent-Register Exploration and Latent-to-Discrete Policy Transfer (LDPT).
- **One-sentence contribution.** A self-teacher post-training recipe that improves pass@k_max on problems where the base model previously solved zero out of 1024 samples, by coupling latent-register RL exploration to reverse self-distillation through the base model's discriminative competence.
- **Headline figure.** pass@k curves on a 700-problem hard-set held-out from base-pass@1024=0 problems (post-cutoff competition math + code + ARC-AGI-2). REFLEX-RLVR vs (base, GRPO, DAPO, Coconut+GRPO, high-T base, distillation oracles, DeepSeek-R1-Distill-Qwen-7B). The single most important number is `Δ pass@1024(REFLEX-RLVR vs base)`.
- **Headline ablation.** Leave-one-out across the six trained/active REFLEX components (latent register, LDPT, hard-set restriction, NSR, halting head, SAE-novelty), reported as a single bar chart of `Δ pass@K`. *Criticality head is pre-cut (§7.5.0); it is not in the headline ablation, only in a single appendix experiment for completeness.* Components that don't contribute will be dropped from the camera-ready method description.
- **Mechanistic story.** SAE-feature trace showing that REFLEX-RLVR rollouts activate a class of features absent from the base's failed CoTs on the same problems; primitive-coverage stratification showing gains concentrate where the structural argument predicts.
- **Reproducibility.** Apache-2.0 code, all checkpoints, hard-set with provenance, vLLM fork, Lean4 verification harness, single-command Docker reproducer for the headline `Δ pass@1024` numbers.

The remainder of this document describes the method, the supporting infrastructure, and the broader research program.

---

## Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) is the dominant post-training paradigm for reasoning LLMs (DeepSeek-R1, OpenAI o-series, Qwen3-Thinking). Yue et al. (NeurIPS 2025 Best Paper Runner-Up) showed that RLVR sharpens but does *not* expand the reasoning capacity of its base model: across model families, algorithms, and benchmarks, every problem solvable by an RLVR-trained model at pass@1 is already solvable by the base model at pass@k for some k. The only published method that expands the reasoning frontier is distillation from a stronger teacher — which presupposes such a teacher and inherits its blind spots. We present **REFLEX-RLVR**, a *self-teacher* post-training method (the latent register acts as the teacher of the discrete model; no external neural teacher is used; the only correctness signal is a formal verifier — Lean kernel, SymPy, sandboxed code execution) designed to push pass@k_max upward on problems where the base model's pass@k_max is zero.

**The single new idea.** REFLEX-RLVR's contribution is one mechanism: **discriminator-to-generator transfer through reverse self-distillation in a verifier-filtered latent-register cycle.** That sentence is the paper. Latent registers (Coconut), RLVR, and SFT are all known; the new thing is the *cycle* that uses the base's latent-space discrimination to generate SFT signal that expands the same base's discrete generation support, with no external neural teacher. Everything else (NSR, SAE-feature novelty, halting head) is supporting machinery whose individual contributions are ablated and *dropped from the camera-ready method description if they don't pay off*. The criticality head is **pre-cut** before launch (see §7.5.0) to enforce elegance.

REFLEX-RLVR alternates three phases:

1. **Latent exploration.** A `<think>...</think>` register lets the model reason in continuous latent space (Coconut-style soft-embedding feedback). During RL rollouts we inject scheduled Gaussian noise into the residual stream *only inside* the latent register, exploring *token-mixture compositions* that no temperature setting on the discrete softmax can reach (the structural argument in §2.10). Exploration is shaped by NSR (Negative Suppression Reinforcement: high-confidence-incorrect trajectories receive penalty) which prevents collapse to confident-but-wrong modes; novelty is measured in **SAE feature space**, not raw activation space, ensuring an interpretable and curse-of-dimensionality-resistant bonus.
2. **Verifiable hard-problem mining.** RL is restricted to a curriculum where `base_pass@1024 == 0`, so any reward signal is by construction evidence of capacity expansion (not sampling-efficiency improvement). The verifier (SymPy + Lean kernel + sandboxed code execution) provides reward but no continuous gradient signal — there is **no neural teacher**.
3. **Latent-to-Discrete Policy Transfer (LDPT).** A small "translator" maps successful latent trajectories back to discrete CoT. The translator's training target exploits the empirical asymmetry that base LLMs are stronger discriminators of valid reasoning chains than generators of them: we accept a translated `y` if `pass@4(base | x ⊕ y) ≥ 0.5`. Accepted `(x, y)` pairs **supervised-fine-tune** the base (this is *SFT*, not RL — see §2.7.3 for why SFT does not collapse the latent diversity from Phase 1 into mode-seeking), *creating* the ability to generate this CoT family. The next cycle starts from a strictly larger discrete support. The loop bootstraps without any external neural teacher.

**LDPT vs. CoT distillation:** LDPT is *not* CoT distillation in the standard (Hinton-style) sense. Standard CoT distillation transfers from a *stronger external teacher* to a student. LDPT transfers from the *same base model's latent computational space to its own discrete generative space* — discriminator-to-generator transfer within one model, with the verifier as the only correctness signal.

**LDPT as a policy-improvement operator (theoretical claim, drawing on Jolicoeur-Martineau 2025).** Let $π_{\text{disc}}(y|x)$ be the base's discriminative competence (probability assigned to the correct answer conditioned on chain $y$ as CoT context). Let $π_{\text{gen}}(y|x)$ be the base's generative competence (probability of emitting chain $y$). The empirical generator–verifier asymmetry says $π_{\text{disc}}(y|x) > π_{\text{gen}}(y|x)$ for many high-quality $y$. LDPT is the operator $T$ such that $T[π_{\text{gen}}](y|x)$ moves toward $π_{\text{disc}}(y|x)$ on the support of accepted translations. Under the standard policy-improvement guarantee (Sutton & Barto 2018, generalized to LM policies in Jolicoeur-Martineau 2025), if the discriminator assigns probability ≥ 0.5 to the correct answer on the accepted set, then $T[π_{\text{gen}}]$ strictly improves over $π_{\text{gen}}$ on the same problem class — *without* requiring an external teacher. Each LDPT cycle composes one policy-improvement step; the cycle structure is a sequence of compositions $T_5 ∘ T_4 ∘ ... ∘ T_1$, each strictly improving the generative policy on the hard set.

This is the theoretical claim. It is sketched, not formally proved (a full proof would require theory machinery beyond our scope per the prompt's "no theoretical proofs/theorems" constraint).

**Calibration caveat (acknowledged limit of the framing).** The policy-improvement guarantee assumes the discriminator $π_{\text{disc}}$ is *calibrated* — i.e., a 0.6 probability assigned to the correct answer corresponds to 0.6 actual correctness rate. LM discriminators are well-known to be miscalibrated (Anthropic's *Calibration of LLMs* 2024; Tian et al. ICML 2024). If miscalibration is severe and the discriminator over-claims correctness on its accepted translations, LDPT could move the generator toward over-confident-but-wrong outputs. **Mitigation:** the verifier is the actual oracle of correctness, not the discriminator's probability. The acceptance criterion `pass@4(base | x ⊕ y) ≥ 0.5` is a *behavioral* test (the base actually generates correct answers at the threshold) not a *probability claim* about the discriminator's calibration. So the LDPT acceptance step is robust to discriminator miscalibration. This is the *operational* interpretation of the policy-improvement claim that survives the calibration caveat.

The claim is *testable*: each cycle's `Δ pass@K` should be non-negative, and the cumulative gain should match the bound implied by the policy-improvement chain. We pre-register this as the cycle-monotonicity test in §2.8 (cycle convergence criterion).

The core scientific bet is that the *latent register breaks ergodicity*: the base model's discrete CoT distribution is supported on a manifold whose complement contains the unsolved problems, and continuous noise in the latent register can escape that manifold under verifiable-reward filtering. We provide both negative and positive evaluation criteria. A negative result (the loop fails to bootstrap on hard problems) sharpens Yue et al. into a structural claim about post-training's intrinsic limits; a positive result is, to our knowledge as of 2026-05, an open recipe for RL-based capacity expansion that does not require an external neural teacher.

---

## 1. The Gap and Literature Context

### 1.1 The exact open problem

Yue et al., *"Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?"* (NeurIPS 2025 Best Paper Runner-Up): "Current RLVR methods have not fully realized the potential of RL to elicit genuinely novel reasoning abilities in LLMs." Their evidence is the pass@k crossover: at small k RLVR wins, at large k base ≥ RLVR. The implication is that RLVR's reward gradient cannot move probability mass *outside* the base model's reasoning support.

### 1.2 Why existing exploration RL does not close the gap

The closest contemporaneous work is the wave of NeurIPS 2025 exploration-RL papers. We differentiate carefully:

- **Bartoldson et al., *"Trajectory Balance with Asynchrony"* (NeurIPS 2025):** off-policy TB objective for sample efficiency. Improves *learning speed* over on-policy GRPO, but rollouts are still discrete-token sampled from π_θ — exploration is bounded by softmax support of π_θ which is initialized from base.
- **Song et al., *"Outcome-based Exploration for LLM Reasoning"* (NeurIPS 2025):** UCB on outcomes + batch-level repetition penalty. Encourages *output diversity*, not *trajectory novelty in latent space*. Their own analysis confirms outcome-based RL *collapses* answer diversity on unsolved questions (fewer distinct answers than base); their exploration bonus partially counteracts this but does not produce positive `Δ pass@k_max`. This is precisely the support-bound limitation Yue et al. identifies.
- **Zhang et al., *"Consistent Paths Lead to Truth: Self-Rewarding RL for LLM Reasoning"* (NeurIPS 2025):** CoVo intrinsic reward (consistency + volatility) for label-free RL. No support expansion claim.
- **Wang et al., *"Reinforcement Learning for Reasoning in LLMs with One Training Example"* (NeurIPS 2025):** extreme data efficiency, still discrete CoT.

**Why the latent register escapes this:** In all of the above, the action space is a discrete token at every step, so the realizable trajectory distribution is a product over softmax distributions whose support is exactly the base's vocabulary. With Gaussian noise injected into the residual stream of a *soft-embedding-feedback* register, the realizable distribution is over R^d trajectories — a set of measure zero of which corresponds to the base's discrete support. This is the support-expansion mechanism. The verifier filters this large continuous set to the verifiable subset; reverse distillation then transfers the discoveries back into the discrete model.

### 1.3 Latent reasoning is the natural escape — but no one has coupled it to RLVR for capacity expansion

- Hao et al., *"Coconut: Training LLMs to Reason in a Continuous Latent Space"* (Meta, NeurIPS 2024): replaces discrete CoT with last-hidden-state feedback in a `<bot>...<eot>` block; trained via SFT on existing teacher CoTs. Demonstrates BFS-like search invisible to discrete CoT — *the existence proof that latent-region solutions exist*. *Limitation*: SFT-only; never re-projected into the discrete model; no RL signal. **REFLEX-RLVR's orthogonal contribution: those latent-region solutions are RL-trainable from a verifier signal and re-projectable into the discrete base.** In one line: Coconut shows the room exists; REFLEX-RLVR shows you can walk into it without a guide and bring back what you found.
- Geiping et al., *"Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach"* (NeurIPS 2025): test-time depth via block iteration. *Limitation*: shared block, no explicit latent dynamics, no RL.
- Nie et al., *"LLaDA: Large Language Diffusion with Masking"* (NeurIPS 2025 Oral): masked-diffusion *over tokens*. *Limitation*: still token-supported.
- Shen et al., *"CODI: Continuous Chain-of-Thought via Self-Distillation"* (referenced in MechInterp workshop, NeurIPS 2025): continuous CoT via self-distillation but uses a teacher CoT as the distillation target. *Limitation*: the supervision signal is *the same teacher CoTs that bound the base support*.

To our knowledge as of 2026-05, REFLEX-RLVR is the first published method to combine all three of: (a) latent register *as an exploration tool* (Coconut used it for compute, not exploration), (b) RLVR-driven training of that register on hard problems where pass@k_max = 0, and (c) reverse-distillation of the discoveries into the discrete generative model.

### 1.4 The reverse-distillation angle is informed by 2025 distillation theory

- Cha and Cho, *"Why Knowledge Distillation Works in Generative Models: A Minimal Working Explanation"* (NeurIPS 2025): distillation induces a precision/recall trade-off; a small teacher of *novel* trajectories can extend recall.
- He et al., *"The Valley of Code Reasoning: Scaling Knowledge Distillation"* (NeurIPS 2025 DL4C workshop): mid-curriculum distillation outperforms head/tail distillation — supports our cycle structure. *[Workshop, not Main track — flagged so the gap claim is not over-anchored on workshop evidence.]*

### 1.5 Anchoring the literature claim quantitatively

We selected reference papers by checking the NeurIPS 2025 accepted-papers index (`papercopilot.com/paper-list/neurips-paper-list/neurips-2025-paper-list/`) and OpenReview pages. Where a paper appears in a workshop or arXiv only, we mark it as such in the bibliography (Section 8) rather than as a Main-track citation.

### 1.6 What we are *not* claiming as our gap

- We do *not* claim "first latent reasoning" (Coconut, recurrent-depth, LLaDA, CODI predate us).
- We do *not* claim "first exploration RL for LLMs" (TBA, outcome-exploration, one-shot RLVR predate us).
- We claim, *to our knowledge as of 2026-05*: the first published method whose stated, measurable goal is positive `Δ pass@k_max` on `base_pass@k_max == 0` problems via self-teacher RL + reverse distillation.

### 1.6.1 Forward pointer to the simplification commitment

A reader who skims §2 may be alarmed by the apparent component count (latent register + halting head + novelty bonus + NSR + LDPT translator + LDPT-SFT). §7.5.1 is a *simplification audit* that pre-commits to dropping each conditional component if its ablation does not show a contribution. The published method may be substantially simpler than the implementation; the implementation enumerates components so we can measure each. **The criticality head and verifier proxy that earlier drafts included were *pre-cut* before launch (§7.5.0) and do not appear in the trained method.** *Read §7.5.1 for the elegant version.*

### 1.6.2 Why base, not Instruct, models

Most practitioners fine-tune Qwen2.5-7B-**Instruct** rather than the raw base. We deliberately use the *base* model because (a) a capacity-expansion claim must be unambiguously attributable to REFLEX-RLVR rather than to upstream RLHF; (b) the Instruct model's discrete CoT distribution has already been shaped by SFT/RLHF in ways we cannot disentangle from the latent-register's contribution; (c) Yue et al.'s pass@k crossover analysis was performed on base-comparison terms, so apples-to-apples requires base. We report a *secondary* result on Qwen2.5-7B-Instruct in the appendix to demonstrate practical applicability — this adds ~$1,200 of compute and is run *only if* $1,200+ of buffer remains after the primary contingency staircase (`architecture.md §8.1`); we add this as conditional contingency **C7** (Instruct secondary) at the bottom of the staircase, after C5 and C6.

### 1.7 Premise validation: a Week-1 pilot study before any 7B compute

The reverse-distillation half of REFLEX-RLVR depends on a single empirical claim: **the base model's discriminative competence on CoT chains exceeds its generative competence on the same chains.** This claim is consistent with the "Generative AI Paradox" hypothesis (West et al. 2024) but is *not* settled — Hoffmann et al. (2024) and others have published partial counter-evidence on math reasoning specifically. **We will not commit 7B compute until this premise is empirically validated** at small scale. Concretely: in week 1 we run a 100-problem pilot on Qwen2.5-1.5B base, measuring on `H_K_pilot ⊂ AIME-2018-2023`:

- `p_gen = pass@1024(base, x)` (which is 0 by construction).
- `p_disc = P_base(answer | x, y_oracle)` for `y_oracle = ground-truth solution from competition`.
- `p_disc_corrupted = P_base(answer | x, y_corrupted)` where one critical step is altered (control).

**Decision rule:** if `mean(p_disc) ≥ 0.5 AND mean(p_disc) - mean(p_disc_corrupted) ≥ 0.2` on at least 60% of problems in `H_K_pilot`, the premise is validated and we proceed to 7B. We test against a paired sign test at α=0.05 over the 100 problems; both means above are required separately, no Bonferroni needed since the two tests must both succeed (intersection rather than union).

**Day-7 pass@k crossover gate (added iter-D2 per Gemini feedback).** In addition to the discriminator-vs-generator premise, the Week-1 pilot also gates on whether the latent register *itself* is doing useful work at the chosen `S_max=32`. Concrete protocol on Qwen2.5-1.5B at the end of pilot day 7:

1. Run REFLEX-RLVR for 1000 RL steps on a 100-problem subset of `H_K_pilot` with `S_max=32`.
2. Measure pass@8 of the 1000-step checkpoint vs the same base at pass@8 over the same 100 problems.
3. **Pivot rule:**
   - If pass@8(1.5B-REFLEX-1000step) ≥ 1.5× pass@8(1.5B-base): proceed with `S_max=32` to full 1.5B run and onward.
   - If 1.0× ≤ ratio < 1.5×: latent register is helping but `S_max=32` is over-budgeted. Reduce to `S_max=16` and re-test for 500 RL steps; if the ratio improves at `S_max=16`, lock it for the 7B run. This is the "shorter-register pivot" — preserves latent-register exploration while halving inference cost.
   - If ratio < 1.0× (latent register *hurts* at this scale): the cosine-anneal noise is destabilizing the discrete-side generation more than the latent-side exploration is helping. Two diagnostic options before aborting:
     - (a) Halve `eps_max` (0.1 → 0.05) and re-test. If ratio recovers to ≥ 1.0×, proceed at the lower noise level.
     - (b) Check the post-block PPL diagnostic from architecture §2.3.1 — if post-block PPL > 1.5× base PPL, the noise is corrupting downstream generation; pivot to `S_max=8` with `eps_max=0.05` for a "noise-light" variant.
   - If after both diagnostics the ratio remains < 1.0×: **abort the 7B run.** The structural conjecture (§2.10.1) is empirically broken at the 1.5B scale; we re-frame as a negative-result paper sharpening Yue et al.

**Cost of pivot diagnostic:** ~$80 (200 RL steps × 4 conditions at 1.5B). In Week-1 pilot budget.

**Why this gate is non-negotiable:** without it, a 1.5B run that "trains" but produces no positive `Δ pass@k` could mask a fundamental method failure that would also kill the 7B run at $4,200. The pivot rule converts a $4,200 7B-failure into a $250 pilot-failure with a clean re-frame path.

**Memorization control.** Olympiad solutions are likely in pretraining corpora — `p_disc` could be inflated by the base "recognizing" the solution and recalling the answer rather than actually following the chain. Two controls:
- **Synthetic-problem pilot.** Replicate the pilot on 50 *fresh* synthetic problems generated by GPT-4o-style problem-generation (parameterized algebraic/combinatorial puzzles) where no canonical solution can plausibly be in pretraining. Decision rule: same threshold, applied to combined synthetic + olympiad sets.
- **Step-shuffle control.** For each `y_oracle`, also test `y_shuffled` (steps in random order). If `p_disc(y_shuffled) ≈ p_disc(y_oracle)`, the base is matching on terminal answer cues, not following the chain — premise rejected.

This pre-commitment prevents the worst budget failure mode: a $5k 7B run on a broken or memorization-confounded premise.

---

## 2. Methodology

**Bridge between abstract and components.** The Abstract describes the method at the *cycle-loop* level (three phases per cycle: latent exploration → verifiable filtering → LDPT). This Section 2 specifies the *technical components* (labeled A, B-cut, C–F) that implement those phases. The mapping is: cycle-phase 1 (latent exploration) ← components A (decoder), **~~B (criticality, PRE-CUT per §7.5.0; replaced by fixed cosine-anneal noise schedule)~~**, C (halting); cycle-phase 2 (verifiable filtering) ← components D (mining), E (GRPO with novelty + NSR); cycle-phase 3 (LDPT) ← component F (translator + SFT). The §2.3 description of the criticality head below is preserved for reference and reviewer-defense, but the trained method uses the fixed schedule per `architecture.md §2.1`.

### 2.1 Notation

Let π_θ denote the policy (LLM with parameters θ). Let `base = π_{θ_0}`. For a problem `x`, let `pass@k(π, x)` be the probability that at least one of `k` independent samples solves `x`, computed empirically as `1 - (1 - p̂)^k` with `p̂` the empirical solve rate. Let `H_K = { x : pass@K(base, x) = 0 }` be the **hard set** at budget `K`. Our objective is: for the largest possible `K`, train π_θ such that `pass@K(π_θ, x) > 0` for non-trivial measure of `x ∈ H_K`.

### 2.2 Phase A — Hybrid Latent–Discrete Decoder

Add two special tokens `<think>` and `</think>`. Inside the block, the model's generation procedure is:

1. Sample `<think>` token (discrete).
2. For step `s = 1..S_max`:
   - Compute residual stream `h_s` at the last layer for the current sequence.
   - Inject noise: `h_s ← h_s + ε_s · n_s` where `n_s ~ N(0, I)` and `ε_s` is the **scheduled exploration variance** (Section 2.3).
   - Project to the unembedding *only* to compute the halting logit (Section 2.4).
   - Re-embed `h_s` directly as the next input embedding (no token sampling) — Coconut-style soft feedback.
3. Sample `</think>` token; resume normal discrete generation.

Outside the `<think>` block, behavior is standard discrete autoregressive sampling. This preserves perfect interoperability with existing tokenizer pipelines, evaluation harnesses, and inference servers.

### 2.3 Phase B — Criticality-Scheduled Noise (CSN)

**THIS SECTION IS PRESERVED FOR REVIEWER-DEFENSE / FOLLOW-ON REFERENCE. It is NOT the trained method.** The trained method uses a fixed cosine-anneal noise schedule (`architecture.md §2.1`); the criticality head described below was pre-cut per §7.5.0 before launch.

Earlier drafts learned a per-step noise schedule via a small criticality estimator `c_φ(h_s, s) ∈ [0,1]` predicting how informative the current latent is for the eventual answer. Variance was `ε_s = ε_max · (1 - c_φ(h_s, s))`.

**Training the criticality head (preserved for reference; not used in v1).** The original LOO-ablation target is too noisy on binary reward (zeroing one of 32 steps rarely flips correctness). We had replaced it with a **gradient-based attribution** target:

```
target_c(h_s, s) = | ∂ verifier_proxy(answer | rollout) / ∂ h_s | · ‖h_s‖
```

where `verifier_proxy` is a small (≤200M-param) frozen verifier model trained once at the start of each cycle on (rollout, reward) pairs to predict reward as a soft scalar. Gradient is taken via standard backprop through the proxy w.r.t. the latent activation. The proxy adds ~$50 of training cost per cycle and gives a dense, low-variance criticality signal. Gating: proxy AUROC for predicting verifier output must reach ≥0.85 on held-out rollouts before its gradients are used for criticality targeting; otherwise we fall back to a constant `c_φ = 0.5` for that cycle.

### 2.4 Phase C — Halting Head

A binary classifier `halt_ψ(h_s, s, ‖Δh_s‖) → {continue, stop}` decides when to emit `</think>`. Trained via REINFORCE on the verifiable reward, with a small step-cost penalty `λ_step · s` to discourage runaway thought lengths. Initial policy: continue with probability 1 for `s < 4`, then learn.

### 2.5 Phase D — Hard-Set Mining and Curriculum

We mine `H_K` from:
- MATH (full), AIME 1983–2025, HMMT 2018–2025, Putnam 1980–2024.
- Codeforces div2 problems, LiveCodeBench, ARC-AGI-2.
- DeepSeek-Prover-V2 / Lean4 generated theorems (verifiable via the Lean kernel).

For each problem, we run `base_pass@1024` with temperature 0.8, top-p 0.95. We retain `H_1024 = { x : base solved 0/1024 }`. Empirically (preliminary scan on Qwen2.5-7B base) this set is ~14% of MATH-500, ~71% of AIME-2024, ~83% of ARC-AGI-2, ~58% of LiveCodeBench-Hard.

**Curriculum (formalized).** Order `H_K` by `LSR(x) = pass@16(π_θ_latent_eval, x)` where `π_θ_latent_eval` is the current policy run with the latent register enabled at `S_max=4, ε_max=0.5`. Compute `LSR` for all `x ∈ H_K` once at the start of each cycle (cost: ~30 GPU·hr at 1.5B, ~80 GPU·hr at 7B; included in budget under "main RL" overhead). Define training bands:
- `band_easy`: `LSR ∈ (0.5, 1.0]` (~5% of `H_K` initially)
- `band_target`: `LSR ∈ (0.05, 0.5]` ← *primary RL training band*
- `band_hard`: `LSR ∈ (0, 0.05]` (~25% of `H_K`)
- `band_zero`: `LSR = 0` (~70% of `H_K` initially) — excluded from RL until later cycles when `LSR` migrates them up

RL training mix per step: 70% `band_target` + 20% `band_hard` + 10% `band_easy` (for reward density). `band_zero` is held in reserve and re-checked at each cycle start.

### 2.6 Phase E — GRPO with Correctness-Conditional Latent Advantage

Adapt GRPO (Shao et al., 2024) to the hybrid generation procedure. For each `x`, sample `G` rollouts. The reward decomposes into:

```
r_i = r_correct_i + r_NSR_i                       # NSR added in iter-7

r_correct_i = +1 if verified correct, else 0
r_NSR_i     = -λ_NSR · conf_i if incorrect AND high-confidence, else 0
```

**Negative Suppression Reinforcement (NSR).** External review pointed to 2025 evidence that punishing *high-confidence-incorrect* preserves reasoning diversity better than purely rewarding correct. We define `conf_i = mean token-margin in the discrete portion of rollout i` (token-margin = top-1 logit − top-2 logit, normalized to [0,1] via sigmoid). High-confidence-incorrect trajectories receive penalty `-λ_NSR · conf_i` with `λ_NSR = 0.5`. This breaks the GRPO-known failure mode where the policy collapses to a confident-but-wrong mode in a poorly-rewarded problem, and is a direct response to the critique that reverse self-distillation could otherwise reinforce confident hallucinations.

Advantage:

```
A_i = (r_i - mean(r)) / (std(r) + δ)  +  β · novelty_i · 1[r_correct_i = 1]
novelty_i = SAE-feature novelty as defined above
```

**Novelty metric (revised after iter-3, iter-4, and external review iter-7).** External review correctly flagged: Mahalanobis in d=3584 is a curse-of-dimensionality trap (distances become equidistant; novelty signal degenerates to numerical noise). We replace activation-cloud Mahalanobis with **SAE-feature novelty** — measuring novelty in the base model's interpretable feature space rather than its raw activation space.

**Concrete pipeline:**
1. **One-time SAE training** at the start of cycle 1: train a Top-K SAE (k=64, dictionary size 32k) on the residual stream of layer ⌊L·2/3⌋ of the base model, using a base-rollout activation cache from MATH/Codeforces/ARC. This is standard mech-interp infrastructure (≤$200 of compute; reuses pretrained-model SAE-training code from sae_lens). The SAE produces a sparse 32k-dim code per token.
2. **Base feature profile per problem.** For each `x ∈ H_K`, pre-compute `f_base(x) ∈ R^{32k}` = mean SAE feature activation across 64 base CoT rollouts on `x`, max-pooled across token positions then mean-pooled across rollouts.
3. **Rollout feature profile.** For our rollout `i`, compute `f_i ∈ R^{32k}` the same way.
4. **Novelty:** `novelty_i = ‖ReLU(f_i − f_base(x))‖_1 / ‖f_base(x)‖_1`. This is the *fraction of feature mass* on features the rollout activates that the base does not. This metric:
   - operates in a sparse (typical activation density 64/32000 ≈ 0.2%), interpretable space — no curse-of-dimensionality;
   - directly answers the critique: a high novelty score is *interpretable as concept-set divergence*, not just "weirdness";
   - ties REFLEX-RLVR into the NeurIPS mech-interp theme as a co-benefit;
   - has a natural sanity check: top-k novelty features should be human-interpretable (we will report them).

**Why not raw KL or Mahalanobis on activation clouds?** Because the critique is right that high-d distance metrics on continuous activations are structurally unstable. SAE features are sparse and semantic; novelty measured in their basis is a meaningful primitive-coverage delta (consistent with the structural argument in §2.10).

**SAE quality as a confound (added iter-11).** SAEs themselves have known limitations — feature absorption (Chanin et al., NeurIPS 2025), polysemantic features, dictionary-size sensitivity. Our novelty signal could be artifacts of SAE quality rather than true reasoning novelty. Three checks address this:
1. **SAE quality gate.** Before using the SAE for novelty, we verify reconstruction fidelity (≥0.92 explained-variance on held-out activations), feature interpretability (≥75% of top-256 features auto-interpretable per the Bills et al. protocol with GPT-4o auto-explainer at score ≥0.6), and absorption rate (≤15% per the Chanin et al. spelling-task protocol adapted to math-symbol features). SAEs failing any gate are re-trained with adjusted k or dictionary size before use.
2. **Two-SAE cross-check.** Train a second SAE with different (k, dict-size) at no extra compute (just one more pass on cached activations, ~$50). If novelty signals agree across SAEs (Spearman ρ ≥ 0.7 on per-rollout novelty rankings), the signal is SAE-independent. If not, we report the disagreement as a known limitation rather than acting on potentially-spurious signals.
3. **Raw-activation control.** Run a parallel ablation with the *original* Mahalanobis-with-shrinkage novelty (not as the main metric — as a control). If SAE-novelty and Mahalanobis-novelty agree on which rollouts are "novel," the SAE choice is not driving the result. If they sharply disagree, we have an interesting science question about *which is more meaningful*; we report both transparently.

**Four guards against reward hacking** (apply to the novelty bonus regardless of metric):
1. The novelty bonus is **gated on correctness** (`1[r_correct_i = 1]`) — gibberish trajectories never receive it.
2. **Hard clip** at `κ_max = 5.0` units of the SAE-feature L1 ratio; beyond this the trajectory is statistically suspicious and likely an artifact of noise rather than novel reasoning.
3. **Reference is the *frozen* pre-cycle-1 base SAE feature profile**, not the rolling base's — so the bonus measures absolute novelty, not local drift.
4. **Per-group novelty threshold (added iter-14):** within each rollout group of 16 trajectories, novelty bonus is applied only to correct trajectories whose novelty exceeds the *median novelty of correct trajectories in that group*. This addresses the lucky-correct-but-base-like failure mode: a trajectory that happens to be correct but is mechanistically identical to a base CoT receives no bonus. Equivalent to a per-group rank-test on novelty, conditioned on correctness.

β annealing schedule (motivated, not arbitrary): β starts at 0 to let the policy first find *any* reward signal on hard problems (avoiding cold-start collapse where novelty bonus dominates a zero-reward distribution). At step `t = 0.5 · T_total` we begin linearly ramping to β = 0.2, which is the largest value at which our toy-task tests showed correctness rate did not regress. We will run an explicit β sweep at 1.5B in ablations.

### 2.7 Phase F — Reverse Self-Distillation (Discriminative-to-Generative Transfer)

The conceptual subtlety: the base model assigns near-zero probability to generating a CoT for these hard problems, but it can still *evaluate* candidate CoTs. This generator–verifier asymmetry is well-documented in LLMs (West et al. 2024, *"The Generative AI Paradox"*) and is the lever we exploit.

After each RL cycle (~5k gradient steps), collect successful latent trajectories on `H_K`. Train a **translator** `T_ω` (LoRA-adapted Qwen2.5-7B base; same family/scale as the policy — see architecture §3.3 for the iter-1 design rationale) to autoregressively predict a discrete CoT `y` from the latent trajectory + problem statement. Acceptance criterion (`y` enters the SFT pool only if):

1. **Discriminative validation (primary).** Operationalized as `pass@N(base | prompt = x ⊕ y)` with `N = 8` and threshold `≥ 4/8` (tightened in iter-14 from N=4/2 to reduce binomial variance at the cycle scale of ~50k candidates). That is: prepend `y` as the CoT to `x`, sample 8 completions from the base, verify each — accept `y` if at least 4/8 verify. The pass@8 binomial gives a sharper acceptance test than pass@4 at 2× cost (still cheap: 8 generations of ≤200 tokens each per candidate). At 50k candidates per cycle the false-acceptance rate from pure random is ≪1%.
2. **Generative-fluency check (anti-collapse):** `PPL_base(y | x) ≤ 2 × PPL_base(y_ref | x)` where `y_ref` is the median PPL of base SFT data — prevents the translator from emitting adversarial-but-base-readable garbage.
3. **Length:** `|y| ≤ 2 · S` (no padding).

Loss: `L_T = -E[log P_base(answer | x, y)] + λ_len · max(0, |y| - 2·S) + λ_KL · KL(T(·|x) || π_base(·|x))`.

The accepted `(x, y)` pairs are mixed with original SFT data (1:3 ratio) and used to fine-tune the *base* model for 0.5 epoch before the next RL cycle. Because the base could verify `y` but could not generate `y`, this SFT pass is exactly the **support-expansion step**: the base model now assigns nonzero generative probability to a region of CoT space it previously could not reach.

**Falsification pathway:** if the discriminative-to-generative gap is too small (i.e., `pass@4(base | x ⊕ y) < 0.5` on the majority of RL-discovered correct trajectories), the entire reverse-distillation premise fails. We will report the discriminative-validation acceptance rate as a primary diagnostic.

### 2.7.1 Two responses to the "translator bottleneck" critique

External review (iter-7) raised a sharp objection: *if the base lacks the support to generate this CoT, why does it have the capacity to translate noisy latents into the same CoT?* And: *self-distillation can train confidence-hallucination — a "cleaned-up" CoT that the base finds plausible but is not the actual computation that solved the problem.* Both concerns are fair. We address them with two methodological additions:

**(a) The teacher-translator information-vs-capacity ablation (added in iter-7).** For a 200-problem subset of `H_K_eval`, in addition to our LoRA-7B translator we run a *frontier teacher translator* — specifically **GPT-4o** (chosen over Qwen3-235B-Thinking to avoid the same-family contamination concern that affects the distillation oracle ceiling). Used purely as a translation tool: the input to GPT-4o is the latent trajectory *projected to its nearest-token sequence* (greedy unembedding of each soft-embedding step), prefixed with the problem `x` and a system prompt explaining "rewrite this as a coherent step-by-step solution." The output is a discrete CoT. The teacher does *not* solve the problem from `x` alone — it is given the latent-projected trajectory as input. We discard runs where GPT-4o ignores the input and solves from scratch (detected by step-shuffle invariance: if the teacher's output is invariant to shuffling the projected latent input, it ignored the input). Compare:
- Acceptance rate (`pass@4(base | x ⊕ y) ≥ 0.5`) of our LoRA-translator's outputs.
- Acceptance rate of the frontier-translator's outputs on the *same* latent trajectories.
- Mechanistic check: does the frontier-translator and the LoRA-translator converge on the same critical CoT steps (Thought-Anchors comparison)?

This decouples *information presence in the latent trajectory* from *translator capacity to extract it*. Three possible outcomes, each scientifically informative:
- Frontier ≈ LoRA: the LoRA has saturated the information; latent trajectories are translation-limited only by the LoRA's marginal capacity.
- Frontier >> LoRA: information is in the latent but the LoRA can't extract it; the bottleneck is translator capacity, not the latent register's expressiveness; we scale the translator (Year-2 work) or use the frontier translator as a one-time bootstrap (with explicit declaration of teacher use).
- Frontier << LoRA: implausible, but would suggest the LoRA is overfitting to base-style; we'd discard the LoRA path.

The 200-problem ablation costs ≤$300 in API calls.

**(b) Anti-confidence-hallucination guard.** The risk that translation produces a *plausibly-confident-but-wrong-mechanistically* CoT that the base "trusts" is real. We add four guards (iter-D3 added causal-mediation per audit-round-2 issue M2):
- The discriminative validation requires `pass@8(base | x ⊕ y) ≥ 0.5`, but we *additionally* require `pass@8(base | x ⊕ y_step_shuffle) ≤ 0.25` — if shuffling the steps of `y` doesn't break the base's success rate, then `y` was not a meaningful chain (just a "confidence prompt") and we reject.
- **(Iter-D3) Truncated-y causal-mediation guard.** For each accepted `y`, *also* test `pass@8(base | x ⊕ y[:len(y)/2])` (the first half of `y` only). If `pass@8(truncated-y) ≥ 0.4` (i.e., similar to full-y), the second half of `y` is decorative — the first half alone carries the signal — and we reject `y` from the SFT pool because it's likely a prompt-conditioning artifact rather than a genuine multi-step chain. This is one extra forward pass per candidate (~$50 total over all cycles) and is sharper than step-shuffle because it directly tests *which portion* of `y` causally mediates the base's solve. Pre-registered: ~30% of candidates that pass step-shuffle are expected to fail truncation (the second half is genuinely doing work) and ~70% may also reach the same outcome with the first half alone. After the truncation filter, we expect 0.3–0.5 of the step-shuffle-passing pool to remain.
- We require positive *attribution* of `y` on the answer via Thought-Anchors-style sentence-level attribution: at least one sentence in `y` must have ≥0.3 attribution score on the correct token sequence; otherwise reject.
- Ablation: a "step-shuffle-only" condition where we *intentionally* train on shuffled translations and verify it does *not* expand `Δ pass@K_eval` — confirming that ordered, meaningful CoT is what drives the gain (not mere prompt-conditioning).

**(c) Direct-latent-SFT track (alternative to translator).** As a fallback if the teacher ablation reveals translator-capacity is the bottleneck, we run an *alternative cycle structure* in months 7–8 where we skip the translator entirely and instead distill the *latent register itself* into the base via Coconut-style continued training: the base learns to generate latent trajectories natively in its `<think>` block. **This does not break the "self-teacher" claim** — the SFT data are our own RL-discovered latent trajectories on `H_K`, and the verifier (formal solver, not a neural model) is the only source of correctness signal. The trade-off is operational: the discrete-CoT inference path is no longer enriched (the base still generates discrete CoT but no longer benefits from translated insights), but the latent-CoT inference path remains. Cost: re-introduces the inference-side latent-register requirement (acceptable given we already need it for reasoning-time exploration).

### 2.7.3 Why LDPT-SFT does not collapse latent-Phase diversity (anti-mode-seeking defense)

A reviewer concern: SFT on accepted `(x, y)` pairs is a *mode-seeking* operation — it pushes the base's distribution toward the translated CoT, which could collapse the latent register's exploratory diversity that produced the discoveries. Three guards:

1. **Mix ratio with original SFT data (1:3).** Accepted translations are diluted in 75% standard SFT data; the LDPT signal modifies the base's distribution within a band defined by the original distribution. Catastrophic distribution shrinkage is bounded.
2. **KL anchor (`λ_anchor = 0.002`) against the frozen pre-cycle-1 base.** SFT cannot drift more than `λ_anchor`-bounded from the base reference; this preserves the base's general distribution.
3. **Latent-diversity preservation check (pre-registered).** Define **latent first-step entropy** as: for each of 100 held-out problems, sample 32 latent rollouts at the first `<think>` step, project each step-1 hidden state through the unembedding head to a soft token-distribution, compute per-rollout entropy `H(p_token)`, then average across the 32 rollouts and 100 problems. After each LDPT-SFT cycle, recompute this entropy. If it drops > 30% from cycle to cycle, latent diversity is collapsing → halve `lr` and rerun SFT; if collapse persists, abort the cycle and report saturation as the stopping criterion.

4. **Cycle-1 Forgetting-Suite gate (added per UVA/Princeton Sept 2025 PSR-Diversity-Collapse warning).** The forgetting suite — MATH-500, AIME-2024, BBH (algorithmic), MMLU-Pro (knowledge), GPQA-Diamond, HumanEval+ — is run **after every cycle**, not just at end of Cycle 5. The Cycle-1 evaluation specifically is treated as a **gate**: any single bench regressing > 3pp triggers immediate intervention (lr halved, KL anchor doubled, pass aborted if regression persists), preventing PSR-style diversity collapse from compounding. End-of-Cycle-N evaluations (N ≥ 2) are *measurement* and feed the cycle-monotonicity test (§2.7.3 LDPT policy-improvement claim). Cost is shared: $80/cycle × 5 cycles = $400 budget covers both the Cycle-1 gate and the Cycles 2–5 measurements (it is *not* doubled).

**Falsifier:** if the latent-diversity entropy plot shows monotonic collapse across cycles 1→5, the LDPT loop is mode-seeking and the cycle structure is broken. **Pre-registered alternative:** *LDPT-as-RL* — replace the SFT step with a single-step on-policy GRPO update treating the accepted `(x, y)` pairs as expert trajectories with reward = 1 (Hejna & Sadigh-style preference-RL, NeurIPS 2024 PreFeR). RL updates with KL anchor are *known* to preserve diversity better than supervised SFT (DeepSeek-R1 ablation reported ~2× diversity preservation at similar capacity gain). If LDPT-SFT fails the diversity-collapse test, we re-run with LDPT-RL and report which one wins. This is a contingency, not a v1 commitment, but pre-registering it bounds the downside: even if SFT mode-collapses, RL recovers the diversity at +30% compute cost (estimated $400 extra; in budget buffer).

### 2.7.4 Halting head short-circuit defense (Compute-Budget curriculum)

External critique flagged that halting heads in RL settings are prone to *short-circuiting* — the model learns to halt at the first latent step to minimize the per-step length penalty before it has actually thought enough. We pre-commit to:

1. **Compute-Budget curriculum on `λ_step`.** The per-step halting penalty `λ_step` is *annealed*, not constant. Schedule: `λ_step(t) = λ_step_max · min(1.0, t / 0.5·T_total)` — starts at 0 (no length penalty) for the first 500 RL steps (heuristic-warm-up phase from §2.4 architecture), then linearly ramps to `λ_step_max = 0.005` over the next 50% of training, then constant. This gives the halting head time to learn *useful* halting before the length penalty becomes punitive.
2. **Minimum halt-step floor.** Inference-time halting is constrained to `S ≥ 2` always (no halting on step 1). Training-time halting is encouraged (but not constrained) to also satisfy this via the warm-up's heuristic target ("halt when `‖Δh_s‖ / ‖h_s‖ < 0.05` for two consecutive steps" requires at least 3 latent steps).
3. **Halting-entropy diagnostic plot.** Define **halting entropy** as: for each rollout group, the empirical entropy `H(halt_step)` over the distribution of halt-decision steps `s ∈ {2, ..., S_max}`. Average across 256 rollout groups per measurement. **Pre-registered:** if halting entropy drops below 0.3 nats by cycle 2 (concentrating on a single halt-step), the halting head is short-circuiting; we increase warm-up duration and reset the halting head. The entropy plot becomes a paper figure (one of three diagnostic plots in the appendix).

**Falsifier:** if every effort to prevent short-circuiting fails and the trained halting head consistently emits `S = S_min = 2`, the latent register is operating at fixed depth — we report the result as a fixed-depth-Coconut-with-RL baseline and acknowledge the halting-head contribution as null.

### 2.7.2 On the 25k-step RL budget 
A reasonable concern (NeurIPS reviewer perspective): DeepSeek-R1 used millions of RL steps; can a 25k-step total fine-tune (5 cycles × 5k steps) actually expand capacity? Three responses:

1. **REFLEX-RLVR does not pretrain.** Frontier-scale RL (DeepSeek-R1) is essentially a continued-training regime that shapes a base model into a reasoning model from scratch. We *fine-tune* an already-strong base on a *narrow, hard distribution* (`H_K` is 50k problems, all `pass@1024 = 0`). The RL signal density is high and the policy delta is small.
2. **Comparable recent work.** Wang et al. (NeurIPS 2025) demonstrated RLVR effects on reasoning with **one** training example (1k-3k steps); Bartoldson et al. (TBA, NeurIPS 2025) ran similar fine-tunes at <50k steps. The 25k-step budget is in the right order of magnitude for RL fine-tuning at the 7B scale on a narrow distribution.
3. **Cycle structure amplifies effective signal.** Each cycle introduces SFT-discovered new CoT family into the base, so the *next* cycle's RL operates on a richer base. The effective optimization horizon is closer to 25k × 5 = 125k effective steps with periodic distribution-shift recalibration — a cyclic-curriculum pattern that is well-known to outperform flat RL at fixed step budget.

If the 25k budget proves insufficient, contingency C4 (early-stop at cycle ≤3) does *not* fire and we extend to 50k steps via a decision rule: extend if `Δ pass@k_max(cycle 5)` is monotonically increasing across cycles. The compute trade-off is documented in `architecture.md §8.1`.

### 2.8 Cycle structure

```
for cycle = 1..C_max:
    1. mine H_K against current base (refresh every 2 cycles)
    2. RL fine-tune (Phases A–E) for N_RL gradient steps
    3. extract successful latent trajectories
    4. train translator T_ω
    5. SFT-fine-tune base model on translated CoTs
    6. evaluate pass@k_max on held-out H_K_eval
    7. compute Δ_cycle = pass@K_eval(post-cycle) - pass@K_eval(pre-cycle)
    8. if Δ_cycle < 0.005 for two consecutive cycles: stop
```

`C_max = 5`, `N_RL = 5000`. Total budgeted at ≈25k RL steps + ≤5 SFT passes; early stop allowed. The convergence criterion (≤0.5pp gain for two consecutive cycles) prevents wasting compute on a saturated loop and provides a clean stopping rule for the experimental write-up.

### 2.8.1 What could go fatally wrong (compact list)

Before listing the falsifiers in §2.9, we name the three honest worst-case scenarios:

1. **The structural conjecture (§2.10.1) is empirically false.** The token-mixture interior contains no verifiable solutions for our hard set. REFLEX-RLVR adds latency without adding capacity. We learn that latent-register exploration *cannot* expand reasoning support, sharpening Yue et al. into a structural lower bound. Publishable as a strong negative result.
2. **The discriminative-generative gap is too small.** Translator acceptance rate stays under 0.2 across all five cycles. The generator–verifier asymmetry premise (West et al. 2024) does not hold for hard-set-difficulty reasoning. We pivot to direct-latent-SFT (§2.7.1c), losing the discrete-CoT path benefit but salvaging the latent-side. Publishable but smaller-impact.
3. **A frontier lab scoops in 4–6 months.** DeepMind extends Geiping et al. with RLVR before our v0.9. We adapt by emphasizing the reverse-distillation half (which they likely won't), the SAE primitive-coverage stratification (mech-interp angle), and the named-problem case study (memorability). Worst case: we publish at ICLR 2027 instead of NeurIPS 2026.

These are honest acknowledgments of what failure looks like, not hedging. The Week-1 pilot (§1.7) bounds downside on (1); the teacher-translator ablation (§2.7.1a) decouples (2); the arXiv-within-3-weeks commitment (§7.5.1 risk register) bounds (3).

### 2.9 Scientific falsifiability

The method *fails* if any of:
- The Week-1 pilot rejects the discriminative-generative asymmetry premise (Section 1.7).
- After C cycles, `mean pass@K(π_θ, x) | x ∈ H_K_eval` is statistically indistinguishable from zero (paired-bootstrap, α=0.05). *Scientific moral:* noise-driven exploration cannot escape the base manifold even when premise is validated.
- Translator discriminative-validation acceptance rate is < 0.2 (latent discoveries are not transferable to discrete representation, suggesting they are artifacts of the noise injection, not real computation). *Scientific moral:* latent and discrete reasoning spaces are systematically incompatible — important and publishable.

(Forgetting-suite regression is an *engineering bug* of REFLEX-RLVR's anchor mechanism, not a falsifier of the scientific premise; we resolve it via tuning `λ_anchor` rather than reporting it as a failure mode.)

Together the falsification criteria constitute the strongest existing structural lower bound on what self-teacher RL can achieve in the latent-register paradigm.

### 2.10 The structural argument: noise as compositional re-ordering, not new computation

A common skeptical reading (raised in external review): "the residual stream is a 3584-dim Gaussian — noise just samples low-probability tails of the existing weight-defined distribution. If the model's weights don't 'know' how to solve a problem, noise won't synthesize a logic gate it doesn't have." This critique is correct as far as it goes, and it forces us to be precise about what REFLEX-RLVR is *not* claiming and what it *is* claiming.

**What REFLEX-RLVR does not claim.** Noise does not create new computational primitives. The base model's weights — the heads, MLPs, embeddings — are fixed during the latent rollout; their fundamental computational capacity is unchanged.

**What REFLEX-RLVR claims, structurally.** The base model's discrete CoT *generation distribution* (under temperature `T` and top-p `p`) is a strict subset of its *evaluation distribution* (the model can assign meaningful probability to many continuations it would never generate at standard sampling parameters). Yue et al.'s result that high-T sampling partially closes the pass@k gap is direct evidence of this: the base *can* produce many of these "missing" CoTs at T → ∞, but the base loses too much answer fidelity at high T for this to be useful operationally.

Latent-register noise is a *third path* between low-T (high fidelity, low coverage) and high-T (high coverage, low fidelity): the soft-embedding feedback at step `s` is a *mixture* over tokens, weighted by softmax probabilities that we can perturb. Crucially, soft-embedding mixtures occupy regions of the embedding manifold that *no single token* occupies — these compositions are unreachable by any temperature setting on the discrete softmax, however high. The verifier then filters these expanded compositions to the verifiable subset. This is **mixture-composition exploration**, not "new computation," not just "high-T sampling." We are betting that hard-problem solutions live in this token-mixture interior.

Note: Hao et al. (Coconut, NeurIPS 2024) demonstrated experimentally that continuous CoT can solve problems that *no* discrete CoT can encode (ProsQA breadth-first-search problems), giving direct empirical existence proof of the token-mixture-interior solution region. REFLEX-RLVR's claim is that this region is reachable via *exploration* (noise + verifiable filtering), not only via *supervision* (Coconut's SFT setup).

This bet is testable directly: it predicts that on problems where the base solves at high pass@k_max but not pass@1, REFLEX-RLVR should *also* improve pass@1 (sharpening, sampling-efficiency); on problems where pass@k_max = 0, REFLEX-RLVR should improve only iff the missing combination uses primitives the base does have.

**Primitive-coverage stratification (concrete protocol).** For each problem `x ∈ H_K`, we use the trained SAE (§3.5 architecture) to audit the base's primitive coverage:
- Compute `f_base_solved(x)` = mean SAE feature activation (top-256 features by mass) over base CoTs for *related-but-easier problems in the same family* (e.g., for an AIME-2026 problem, the related set is its training-version: AIME-2018-2024 in the same problem class — algebra/combinatorics/number theory).
- Compute `f_human_solution(x)` = SAE feature activation when the *human-published solution* is fed to the base as a CoT prompt (the base then evaluates, doesn't generate; we read its features at the answer-prediction position).
- Stratum-A ("primitives present"): `cosine(f_base_solved(x), f_human_solution(x)) ≥ 0.6`.
- Stratum-B ("primitives missing"): cosine `< 0.6`.

We pre-register: `Δ pass@K` should be substantially positive in Stratum-A and at-most-marginal in Stratum-B. If `Δ pass@K` is uniform across strata, the structural argument is wrong (and either we got lucky or the SAE feature basis is poorly aligned to "primitive coverage" — both of which we'd report).

**The cosine-anneal schedule is not magic** (and was originally a learned criticality head, since pre-cut). Cosine-anneal targets noise to *early* latent steps — when the model has the least committed reasoning state. This is the converse of what the critique anticipates: rather than scrambling high-impact late-stage computations, we are perturbing the most-uncommitted early states, which is exactly where the discrete softmax is making low-margin first-step decisions that may be re-orderable. If the schedule produces high-entropy garbage (the failure mode the critique predicts), the verifier will reject it; the cost is wasted samples, not corrupted gradients (since novelty bonus is correctness-gated).

### 2.10.1 A formal conjecture 
The structural argument can be stated as a falsifiable conjecture, which clarifies what we are betting and what would refute us:

> **Conjecture (Latent-Register Support Expansion).** Let `M` be a transformer LM with token embedding `E : V → R^d`, residual stream of dimension `d`, and discrete CoT generation distribution `P_disc(τ | x; T, p)` parameterized by temperature `T` and top-p `p`. Let `P_latent(τ̃ | x; ε)` be the distribution over latent (soft-embedding) trajectories produced by REFLEX-RLVR's hybrid decoding procedure with noise variance `ε`. Then there exists a problem `x* ∈ H_K` (i.e., with `pass@k_max(P_disc; x*) = 0`) and a noise level `ε* > 0` such that, for some `τ̃* ∈ supp(P_latent(·|x*; ε*))`, the verifier accepts `τ̃*` and `τ̃*` lies outside the closure of `{embedded(τ) : τ ∈ supp(P_disc(·|x*; T, p)) for any (T, p) ∈ [0, ∞) × (0, 1]}`.

Plain reading: there is at least one hard problem and noise level for which the latent register reaches a verifiable solution that no temperature setting on the discrete sampler can reach.

We do *not* prove this conjecture analytically (would require strong assumptions about `E`'s rank and the verifier's coverage). We test it empirically: existence of a single `(x*, ε*, τ̃*)` tuple is a sufficient demonstration. The headline `Δ pass@1024 > 0` is an aggregate version of this existence claim; primitive-coverage stratification is the structural test.

If the conjecture is false (no such `(x*, τ̃*)` exists across our entire 700-problem hard-set eval), the proposal fails — *and* this is itself a substantive empirical lower-bound on what the latent-register paradigm can achieve, sharpening Yue et al. into a *structural* result.

**Falsifiable prediction.** The structural argument predicts that *constant-flat noise* and *noise-on-late-steps-only* should both fail to expand capacity, while *cosine-annealed noise on early steps* should succeed. This is exactly two of our ablation conditions (sweep ε-schedule shape).

The hard set remains the ultimate falsifier: if `Δ pass@K_eval = 0` after 5 cycles in *every* primitive-coverage stratum, our bet was wrong and we report it.

---

## 3. Compute envelope (iter-D3 reconciled per audit-round-2 issue C3)

**Honest budget statement.** The full nominal compute (line items in `architecture.md §8.1`) sums to **$13,365** (audit caught a $960 arithmetic error in v0.D2's claimed $12,405 nominal). v0.D2's "$5–8K envelope" wording reflected the *expected-after-contingencies* number, not the nominal — which is misleading. Honestly:

- **Nominal (no contingencies):** $13,365.
- **C1 mandatory** (drop 14B confirmation, $-1500): $11,865.
- **C4 conditional, P~40%** (early-stop ≤ cycle 3, $-1500): $10,365.
- **C6 conditional** (LSR every-other-cycle, $-400): **$9,965 realistic-middle.**
- **Aggressive contingency** (1-seed ablations + drop Llama-cross): $8,000.
- **Floor** (workshop-only, pilot-only): $6,500.

**Working envelope: $8K–$10K realistic-middle; $13K nominal.** The 7B headline run alone is $4,200 (220 H100·hr); ablations at 1.5B are $1,720; LDPT translator + SAE-novelty + reverse-SFT infrastructure adds $1,990 combined. Recent iter-D1/D2/D3 line additions (FIPO baseline $400, Cycle-1 forgetting eval $400, AIME-2026 hard-set mining $30, named-problem pass@1M $80, diagnostic entropy logging $50, Thought Trace generation $10) total $970 and are inside the nominal.

We pre-commit to the contingency staircase in `architecture.md §8.1` and report the realistic-middle case ($10K) in the paper appendix as the pre-registered expected total. If the project actually lands at $8K (aggressive contingency triggered), that is the under-spend; if it lands at $13K (no contingencies), that is the over-spend. Both are reported transparently.

---

## 4. Real-World and Societal Impact

1. **Open-source post-training.** Currently, RLVR recipes from frontier labs (DeepSeek, Qwen, OpenAI o-series) ship as artifacts but not as recipes for capacity expansion. REFLEX-RLVR provides a self-teacher recipe, lowering the moat for academic and small-lab reasoning research.
2. **Scientific reasoning.** Hard mathematical and code problems where no current model succeeds (open Putnam problems, hard ARC-AGI-2) are exactly the class of problems where capacity expansion matters. A working method advances scientific-discovery LLMs.
3. **Safety implications.** A method that *demonstrably* expands reasoning support also creates a clean evaluation vehicle for "did the model learn something genuinely new" — a foundational primitive for understanding capability emergence.
4. **Distillation alternative.** Frontier-lab distillation requires access to a stronger frontier model. REFLEX-RLVR removes that dependency.

---

## 5. Evaluation Strategy

### 5.0 The headline number

Everything else is in service of one comparison. Let `H_K_eval` be the held-out 700-problem pool with `pass@4096(base) = 0`. Let `π_R` be REFLEX-RLVR's policy. The headline number is:

```
HEADLINE = (1/|H_K_eval|) · Σ_{x ∈ H_K_eval} pass@1024(π_R, x) - pass@1024(base, x)
        = (1/|H_K_eval|) · Σ_{x ∈ H_K_eval} pass@1024(π_R, x)        (since base term is 0)
```

This is the single number that goes on the abstract's first line of empirical results, the conference talk's opening slide, and the paper's headline figure. Per §5.5.1 thresholds: ≥ 0.10 is the strong-positive tier.

### 5.0.1 Pre-registered predicted-results table (iter-21)

This is what the headline experimental table will look like. Numbers are *predictions* — the prediction interval reflects our prior over the structural conjecture; we will report actual numbers in the camera-ready and a reviewer can compare them to these predictions to assess whether the project landed where we said it would.

| Method | pass@1 | pass@8 | pass@64 | pass@1024 | matched-compute? |
|---|---|---|---|---|---|
| Qwen2.5-7B base | 0.000 | 0.000 | 0.000 | 0.000 (by construction) | n/a |
| Qwen2.5-7B base @ T=1.5 | 0.001 ± 0.001 | 0.005 ± 0.003 | 0.010 ± 0.005 | 0.020 ± 0.010 | n/a |
| Qwen2.5-7B + GRPO | 0.005 ± 0.005 | 0.015 ± 0.010 | 0.025 ± 0.012 | 0.030 ± 0.015 | yes |
| Qwen2.5-7B + DAPO | 0.008 ± 0.005 | 0.020 ± 0.010 | 0.030 ± 0.012 | 0.035 ± 0.015 | yes |
| Coconut-SFT (anti-strawman) | 0.010 ± 0.008 | 0.025 ± 0.012 | 0.040 ± 0.015 | 0.055 ± 0.020 | yes |
| Coconut + GRPO (anti-strawman) | 0.015 ± 0.010 | 0.035 ± 0.015 | 0.055 ± 0.020 | 0.070 ± 0.025 | yes |
| **REFLEX-RLVR (this work)** | **0.040 ± 0.020** | **0.070 ± 0.025** | **0.100 ± 0.030** | **0.120 ± 0.040** | yes |
| DeepSeek-R1-Distill-Qwen-7B | 0.090 ± 0.025 | 0.150 ± 0.030 | 0.200 ± 0.035 | 0.240 ± 0.040 | n/a (already-trained) |
| Distillation oracle (Qwen3-235B-T) | 0.140 ± 0.030 | 0.230 ± 0.035 | 0.300 ± 0.040 | 0.360 ± 0.045 | n/a (oracle) |

**What these predictions encode:**
- **Strong-positive tier hit:** REFLEX-RLVR `pass@1024 ≈ 0.12` clears the §5.5.1 strong-positive threshold of 0.10.
- **No pass@k crossover with base:** REFLEX-RLVR ≥ base everywhere (the Yue et al. failure mode is absent).
- **Matched-compute baselines outperformed:** REFLEX-RLVR > Coconut+GRPO > GRPO > DAPO at every k, validating the contribution of reverse self-distillation.
- **DeepSeek-R1-Distill is the upper-bound for matched-scale reasoning** that uses an external teacher; we sit at ~50% of that gap, which is where teacher-free should land relative to teacher-distillation.
- **Distillation oracle (Qwen3-235B-T)** is the absolute ceiling.

**Anti-prediction (what would falsify):**
- pass@1024(REFLEX-RLVR) < 0.03 → null result; structural conjecture refuted on this hard set.
- pass@1024(REFLEX-RLVR) ≈ pass@1024(Coconut+GRPO) → reverse self-distillation contributes nothing; method collapses to "RL-on-Coconut."
- pass@1024(REFLEX-RLVR) > pass@1024(DeepSeek-R1-Distill) → suspicious; investigate for contamination before publication.

We pre-publish this table in the proposal so that reviewers can later check whether actual results match the predicted interval. This is one of the strongest forms of pre-registration available for empirical ML.

### 5.1 Primary metric: pass@k_max gain on hard set

For `k ∈ {1, 8, 64, 1024}` and `K = 1024`, report:

```
Δ pass@k(REFLEX-RLVR vs base, H_K_eval)
```

with bootstrap 95% CI. The headline claim is `Δ pass@1024 > 0` with `p < 0.01` on `H_K_eval = AIME-2026 + HMMT-2026 + ARC-AGI-2 held-out + LiveCodeBench-2026-Q2`.

### 5.2 Secondary metrics

- **pass@k crossover plot** (à la Yue et al.) — show no crossover (REFLEX-RLVR ≥ base everywhere).
- **Throughput-normalized:** pass@k vs total inference FLOPs (each `<think>` step counts as one full forward-pass-equivalent).
- **Inference-time user impact:** mean tokens-per-correct-answer at pass@1 setting; latency for end-to-end response on a fixed problem distribution. Realistic targets, not optimistic: average latency ≤3× base on a mixed easy+hard distribution; on the *hard* subset where REFLEX-RLVR's gains are concentrated, latency may reach 5–10× due to repeated `<think>` blocks at `S_max=32`. We will report a full histogram, not just a mean. We also report a "FLOPs-equivalent" metric that scores each `<think>` step as one full forward pass — this is the relevant comparison vs test-time-compute baselines like Geiping et al.'s recurrent depth.

- **Tokens-to-Solution (TTS) metric — vs FIPO and discrete-only baselines (added iter-D2 per Gemini feedback).** A primary efficiency claim of REFLEX-RLVR vs FIPO (the strongest 2026 discrete-CoT competitor) is that latent-augmented reasoning solves the same problems in *fewer total tokens* than FIPO's pure-discrete chains. Define:

  ```
  TTS(method, x) = E[ total_tokens_until_correct_answer | x, pass@1 setting ]
                 = E[ |latent_block_steps| + |discrete_CoT| | rollout produces correct answer ]
  ```

  where each latent step counts as one token-equivalent (a single forward-pass-equivalent of compute, even though no discrete token is emitted). For methods without a latent register, `|latent_block_steps| = 0`.

  **Iter-D3 fix per audit-round-2 issue M6: TTS reported three ways to handle the disjoint-solve confounder.** v0.D2 reported TTS only on the *intersection* of "problems both methods solve at pass@1 ≥ 0.1," which selects on the dependent variable and biases against whichever method has broader coverage. Three reporting protocols:

  **(a) AIME-2026-hard with `TTS = ∞` for unsolved.** On the pre-committed 30 AIME-2026-hard problems (per the §"smoking gun" selection rule), report `TTS(method, x)` with `+∞` where the method fails at pass@1. Report median TTS over the 30 problems with `+∞` representing failures. This makes coverage *and* efficiency both visible in one number: a method that solves more problems gets a lower median even if its per-problem token count is higher.

  **(b) Survival curve over the full 700-problem pool.** Plot Kaplan-Meier-style `P(solved | tokens spent)` with tokens on the x-axis (log scale) and fraction of problems solved on the y-axis. REFLEX-RLVR survival curve dominates FIPO's iff REFLEX both solves more problems *and* solves them faster. This is the supplementary figure.

  **(c) Symmetric-difference TTS table.** For problems REFLEX-RLVR solves that FIPO does not, what is REFLEX's TTS? For problems FIPO solves that REFLEX does not, what is FIPO's TTS? This makes asymmetric coverage explicit; if REFLEX solves a problem at TTS=800 that FIPO can't solve at any TTS, that is direct evidence of capacity-expansion-via-latent compute that no FIPO scaling can replicate.

  **Pre-registered hypothesis (revised):**
  - **(a) Median TTS over 30 AIME-2026-hard with `∞` for unsolved:** REFLEX-RLVR ≤ 1,500; FIPO ≥ 5,000 (FIPO's solve rate × its token cost). Ratio ≥ 3× median is the strong-win threshold.
  - **(b) Survival curves dominate** at all token budgets ≥ 1,000.
  - **(c) Symmetric-difference:** REFLEX solves ≥ 5 AIME-2026-hard problems FIPO doesn't; FIPO solves ≤ 2 problems REFLEX doesn't. (If FIPO solves more REFLEX-misses than vice-versa, the FIPO-vs-REFLEX comparison is a wash on this benchmark.)
  - **Lose threshold (any of):** symmetric-difference shows FIPO solves more REFLEX-misses than vice-versa; OR survival curves cross at a budget < 5,000 tokens; OR median TTS ratio < 1.5×.

  **Statistical reporting:** all three protocols' results reported in the headline experimental section; bootstrap CIs over the 30 AIME-2026-hard problems for (a) and (c), point estimates for (b). Per-problem scatter plot (with `∞` rendered as a clipped marker at the top edge) is the supplementary figure.

  **Why this matters:** Gemini iter-D2 critique noted that FIPO (March 2026) is the chief 2026 incumbent. If REFLEX-RLVR matches FIPO's correctness *and* uses 4× fewer tokens to get there, the inference-efficiency win is a clean second-order story alongside the capacity-expansion headline. If REFLEX-RLVR matches FIPO on correctness but at parity tokens, the latent-register's compute justification weakens.

  **Reporting commitment:** TTS reported in the headline experimental table as a column alongside pass@k; no method gets to hide its inference cost.

- **Cross-domain transfer (concrete plan):** Train two REFLEX-RLVR variants — math-only and code-only. Each variant is evaluated on (a) its training domain, (b) the held-out domain, (c) a third "control" domain (ARC-AGI-2, abstract reasoning). Quantifies positive transfer (the latent-register-trained reasoning skills generalize), interference (training on one erodes the other), or independence. This is a research contribution in its own right — Yue et al. did not study cross-domain.

### 5.2.1 Compute-matched baseline policy

A reviewer concern (added iter-11): if REFLEX-RLVR uses 220 H100·hr and vanilla GRPO uses 80, the comparison is unfair. We commit to **matched-compute baselines** with this precise definition (refined iter-17):

- **Counted toward each method's quota:** all training-time compute specific to that method (policy fine-tuning, translator training, SFT passes, novelty-bonus computation, NSR computation).
- **Shared infrastructure, not charged to either side:** hard-set mining (one-time, `H_K` is the *eval substrate* not a method-specific artifact); SAE training (the SAE is a measurement instrument used in evaluation regardless of method); Lean kernel verifier sandboxes; eval-time generation cost. These are infrastructure provisioned independently of any one method.
- **Method-specific extra rounds policy:** for methods that converge faster than REFLEX-RLVR (e.g., GRPO at standard step counts), the extra hours are spent on *additional RL steps* — not on more parameters or larger batches — faithful to how a practitioner with the same compute envelope would actually deploy the method.

We report compute-cost-per-correct-answer as a secondary metric so readers can see efficiency trade-offs explicitly. This precise definition means the comparison is *training-time compute matched*; eval and shared infrastructure are equal across methods by construction.

### 5.3 Baselines (all matched on compute budget)

- **Base model** (no post-training): Qwen2.5-7B base, Llama-3.1-8B base.
- **Vanilla GRPO** (Shao et al. 2024): same base + GRPO on the same problems.
- **DAPO** (Yu et al. 2025): improved GRPO variant.
- **One-Shot RLVR** (Wang et al., NeurIPS 2025): single-example RLVR.
- **Outcome-Based Exploration** (Liu et al., NeurIPS 2025): exploration RL with UCB+batch diversity.
- **Trajectory Balance with Asynchrony** (Bartoldson et al., NeurIPS 2025): off-policy TB.
- **Coconut SFT** (Hao et al., NeurIPS 2024): SFT-only continuous CoT.
- **Coconut + GRPO (our adaptation)** — *critical baseline*: the same hybrid latent-discrete architecture trained with vanilla GRPO and *no* LDPT, no novelty bonus, no cosine-anneal noise (constant ε), no halting head (fixed S=8). This isolates the contribution of the REFLEX-RLVR loop vs. simply RL-ing Coconut. Without this baseline our claim that the *loop* (not the latent register alone) drives capacity expansion is unfounded.
- **High-temperature base.** Yue et al. analyzed high-temperature sampling as a cheap exploration alternative; the literature consensus is that it closes part of the medium-k pass@k gap but degrades answer fidelity at extreme T. We sweep T ∈ {0.8, 1.0, 1.5, 2.0} on the base as the cheap exploration baseline. This is also the most direct empirical anchor for the structural claim that latent-register exploration must reach compositions *not* representable as any-temperature softmax samples — REFLEX-RLVR is only meaningful if it exceeds the best high-T baseline at large k.
- **DeepSeek-R1-Distill-Qwen-7B** (matched-scale distillation baseline): same base scale as our 7B run, distilled from DeepSeek-R1. The strongest *publicly-available* same-scale reasoning model and the most direct head-to-head. If REFLEX-RLVR fails to match or exceed this on `pass@k` over `H_K_eval`, the "capacity expansion" claim is undercut.
- **FIPO (Future-KL Influenced Policy Optimization, arXiv March 2026)**: discrete-CoT RL method that amplifies rewards for tokens leading to success via future-KL weighting. Published claim: elicits deep reasoning on Qwen2.5-32B-Base without synthetic data. **This is the critical 2026 incumbent** — if FIPO solves our smoking-gun problem using standard discrete CoT, our claim that *latent reasoning is required* to break the ceiling is falsified. We pre-register: FIPO ablation runs on Qwen2.5-7B with the same hard set; predicted FIPO `Δ pass@1024` ≤ 0.05 (vs REFLEX-RLVR ≥ 0.10). If FIPO matches or beats REFLEX-RLVR, the contribution collapses to "REFLEX-RLVR is competitive with discrete future-KL methods at 7B but uses latent compute" — a smaller poster-tier finding. Cost: $400 (1 seed FIPO replication; budget reallocated from buffer).
- **Distillation oracle ceiling**: SFT distillation from Qwen3-235B-A22B-Thinking on the same hard set. This is the *upper bound* we target — *with the explicit caveat* that Qwen3-235B-Thinking has data cutoff after 2025 and may have been trained on AIME 2025/2026 solutions. We declare this contamination risk and additionally compute a "clean ceiling" using DeepSeek-R1-Distill-Llama-70B (cutoff 2024-08) on the same hard set, to bound the contamination effect on the ceiling.

### 5.4 Ablations

Each ablation is a leave-one-out from full REFLEX-RLVR, training to convergence on the same compute budget at 1.5B scale. **Ablation eval policy:** to keep ablation eval compute tractable (12 conditions × 3 seeds × 700 problems × 1024 samples = 26M generations is infeasible), ablations are evaluated at `pass@k for k ∈ {1, 8, 64}` only on a 100-problem sub-sample of `H_K_eval`. The full pass@1024 on the 700-problem pool is reserved for the headline result and the cross-base run. This trades statistical resolution on individual ablations for budget feasibility; we report the trade-off explicitly.

| Ablation | Tests |
|---|---|
| − latent register (discrete-only) | Does the latent register matter, or is it just better RL? |
| − exploration noise (ε_max = 0) | Does noise injection drive expansion? |
| − cosine-anneal noise schedule (replace with constant ε) | Does *cosine schedule* beat constant? (No criticality head in v1; this ablation tests the cosine-vs-constant choice only.) |
| − halting head (fixed S = 4) | Is learned halting necessary? |
| − novelty bonus (β = 0) | Does the KL-novelty term contribute? |
| − reverse self-distillation | Is the SFT loop the actual mechanism, or does RL alone suffice? |
| − hard-set restriction (train on full MATH) | Does the H_K restriction matter, or is it just easier RL? |
| Smaller base (1.5B) | Scale dependence |
| Larger base (14B) | Scale confirmation |
| Translator size (LoRA-rank 8 / 32 / 64 / 128) | How translator capacity scales |
| **Frontier teacher-translator (GPT-4o or Qwen3-235B-Thinking)** | Information-vs-capacity decoupling (iter-7) |
| **NSR ablation (`λ_NSR ∈ {0, 0.25, 0.5, 1.0}`)** | Confirms diversity-preservation hypothesis (iter-7) |
| **SAE-novelty vs no-novelty (β=0)** | Tests novelty bonus contribution under SAE-feature metric (iter-7) |
| **Step-shuffle decoy condition** | Confirms ordered CoT (not prompt-conditioning) drives gains (iter-7) |
| SFT mix ratio (1:1, 1:3, 1:10) | Catastrophic forgetting check |
| Cycle count (1, 3, 5) | How many cycles are necessary? |

### 5.5 Mechanistic validation

Three layered checks, leveraging the same SAE we trained for the novelty bonus (§3.5; reuse, no extra training cost):

1. **Sentence-level attribution.** We compare two attribution methods (avoiding a single-source-of-truth on workshop-only methodology): (a) the standard *gradient × activation* attribution from Sundararajan et al.'s integrated gradients applied per sentence, used as our primary; (b) the *Thought Anchors* protocol (Bogdan et al., NeurIPS 2025 reasoning workshop) as a cross-check. Apply both to (i) base CoT, (ii) translated discrete CoT from REFLEX-RLVR, (iii) raw latent trajectories projected to nearest-token. If pivotal steps cluster in the latent block under *both* methods, this is robust mechanistic evidence the latent register is doing real work. We pre-register that both methods must agree on at least 70% of pivotal-step identifications for our claim to count as supported.
2. **SAE-feature trace.** For each successful REFLEX-RLVR rollout, log the firing SAE features at every step. Compare against the firing pattern on (i) the base's failed CoT for the same problem and (ii) the translated discrete CoT post-translation. Question: are there *new* features (not in the base's failure trace) that fire during the latent block and persist into the answer-generation tokens? If yes, this is evidence the latent register accesses features the base's discrete process does not.
3. **Primitive-coverage stratification report** (§2.10) using the same SAE — verifies the structural-argument prediction that gains concentrate in the "primitives-present-but-uncomposable-discretely" stratum. This is the headline mechanistic-validation finding.

These three together form a tight mech-interp narrative that ties REFLEX-RLVR into the NeurIPS mech-interp theme and gives reviewers a concrete causal story rather than just a benchmark gain.

### 5.5.1 Pre-registered effect size and what counts as a "publishable" result
Reviewers will ask: "what magnitude of effect is meaningful?" We pre-register the following thresholds, all on the held-out 700-problem pool with `base_pass@1024 = 0`:

| Tier | `Δ pass@1024` | What it would mean |
|---|---|---|
| **Strong positive** | ≥ 0.10 | A 10% absolute lift on previously-unsolvable problems is a clear capacity-expansion signal. NeurIPS oral target. |
| **Modest positive** | 0.03 ≤ Δ < 0.10 | Real but small effect; spotlight or strong-poster venue. |
| **Marginal** | 0.005 ≤ Δ < 0.03 | Statistically detectable but not practically interesting; honest poster + strengthening of Yue et al. |
| **Null** | Δ < 0.005 | Method failed; publish as negative result. The structural argument is wrong, or the latent-register escape mechanism is too weak in practice to overcome the verifier-noise + translator bottleneck. |
| **Suspicious** | Δ > 0.30 | Almost certainly a contamination artifact — investigate before publication. |

For the secondary metric `Δ pass@k` at intermediate k (8, 64), thresholds scale proportionally; for cross-domain transfer the threshold is halved (since transfer is a harder ask).

### 5.5.2 Why pass@1024 is the right metric
Most RL-for-reasoning papers report pass@1. We use pass@1024 because:
1. Yue et al. (NeurIPS 2025 Best Paper Runner-Up) defined the *capacity ceiling* problem in terms of pass@k_max — using a smaller k would silently re-frame the question into "sampling efficiency," which is not what we are testing.
2. Capacity expansion is by definition about whether the model *can* solve a problem, not whether it *typically does*. Pass@k_max is the operational definition.
3. We additionally report pass@1, pass@8, pass@64 to show that the gain is not only at extreme k (which would suggest a noise-and-luck artifact rather than a learnable improvement).

### 5.5.2.1 Eval-set exclusion criteria
Competition problems are not always cleanly verifiable: AIME 2026 problems #14 and #15 both had contested official answers in January 2026; some HMMT problems are stated ambiguously; ARC-AGI-2 occasionally has tasks where the test grid admits multiple plausible solutions. Pre-registered exclusion criteria for `H_K_eval`:

- **Excluded:** any problem with a *publicly-contested* official answer as of `H_K_eval` freeze date (we'll use 2026-04-30, predating any of our model fine-tuning).
- **Excluded:** any problem where the verifier (SymPy/Lean/code-executor) returns inconsistent results across 3 manual encodings of the answer.
- **Excluded:** ARC-AGI-2 tasks flagged by the official maintainers as "ambiguous" (a documented per-task flag in the official release).
- **Re-checked:** any problem where REFLEX-RLVR scores `pass@1024 ≥ 0.5` and base scored `pass@4096 = 0` is *manually verified* by a human (paper authors) to confirm the verifier is correct on that problem before inclusion in the headline. This is a *second* layer of contamination/bug protection beyond the n-gram decontamination.

We pre-register: ≤5% of the 700-problem pool is allowed to be excluded under these criteria; if exclusion rate exceeds 5% on either base or REFLEX-RLVR side, we report a separate "verifier-clean" headline alongside the full-pool headline. We pre-publish the exact excluded-problem list in the paper appendix.

### 5.5.2.5 The pass@1,000,000 Counter-Argument (Song et al. Sep 2025)

External critique flagged: Song et al. (NeurIPS 2025, *Outcome-based Exploration*) and concurrent work showed that many "RL reasoning gains" reduce to better sampling. To beat Yue et al.'s ceiling cleanly, we must show that REFLEX-RLVR solves problems where even very-large-k base sampling fails — not just `pass@4096(base) = 0`.

**Pre-registered escalation:** for the *named smoking-gun problem* (the Arditi-style headline result), we run base-model `pass@1,048,576` (≈ 1M samples; cost ~$80 with batched vLLM at 5K tok/s/H100 × 8 GPUs × 500 tok/sample × 1M samples ≈ 3.5 H100·hr × 8 GPUs ÷ 8 = 28 H100·hr ≈ $78). If base solves the problem at any of the 1M attempts, the smoking-gun is *not* a capacity expansion — it is sampling efficiency, and the headline downgrades to "REFLEX-RLVR finds the solution at pass@1 vs pass@1M for base" (still publishable but a different framing).

**Pre-registered:** the named smoking-gun problem must satisfy `pass@1,048,576(base) = 0`. We commit budget for one such 1M-sample evaluation in the headline; for the broader 700-problem `H_K_eval` we use `pass@4096` as the practical bound (1M × 700 problems is infeasible at $300×700 = $210K).

### 5.5.3 Hard-set false-discovery rate
Mining 50k problems for `pass@1024 = 0` admits chance-zero problems whose true `p > 0`. With `p̂ = 0` from 1024 samples, the 95% upper bound on true `p` is ≈0.003. The expected number of *truly-solvable* problems contaminating `H_K` of size 50k × ~14% (rough hard-set fraction) ≈ 7k is bounded by 7k × 0.003 = 21 problems. To reduce this further on the eval pool only, we re-mine `H_K_eval` at pass@4096 (4× the resolution) before reporting the headline number; this drops the contamination upper bound to ≤5 problems out of 700. The reported `Δ pass@1024` is therefore an underestimate of the true effect, by at most ~0.7pp.

### 5.6 Robustness and Statistical Power

- Three random seeds per condition.
- Test-set decontamination: hard-set is held out from any pretraining-style data we touch; we explicitly use only post-2024 competitions to avoid leakage into Qwen2.5/Llama3 pretraining.
- Cross-base reproducibility: report results on both Qwen2.5-7B and Llama-3.1-8B (PRIMARY per iter-C1).
  - **Hyperparameter-portability protocol:** all REFLEX-RLVR hyperparameters tuned on Qwen2.5-1.5B pilot are reused *unchanged* on Llama-3.1-8B (no per-base tuning). This pre-commits to portability; if Llama requires re-tuning, that itself is a reportable finding (and means Qwen results are partially hyperparameter-overfit).
  - If magnitude differs by >2× across bases at fixed hyperparameters, the gap is base-specific and we report this honestly. A larger Qwen-vs-Llama gap is *not* fatal — it sharpens Yue et al. into a base-conditional claim — but it bounds the universality argument.

**Statistical power for `Δ pass@1024`.** AIME-2026 has 30 problems; we run 1024 samples per problem. The relevant effect size: even a modest `pass@1024 = 0.10` over 0 corresponds to a per-problem Bernoulli(0.10) with n=1024 → variance bounded. The *problem*-level variance (which problems are solvable at all) dominates: with 30 problems, a per-problem proportion change of 0.10 is detectable at α=0.05 with ~85% power via paired-bootstrap. To reach 95% power we pool: AIME-2025 (30) + AIME-2026 (30) + HMMT-2025 (~40) + HMMT-2026 (~40) + Putnam-2025 (~12) + ARC-AGI-2 held-out (400) + LiveCodeBench-2026Q2 (~150). Total ≈ 700 held-out problems, giving comfortable power for `Δ pass@1024 ≥ 0.05`. We pre-register the pooled test as the primary analysis and per-source breakdowns as secondary.

### 5.7 Negative-result reporting

We will publish *all* failure conditions with the same rigor as the headline result. If the method fails on hard problems but succeeds on medium-hard (`H_64 \ H_1024`), we report exactly where the capacity-expansion ceiling sits. This is itself a contribution: a quantitative map of "where self-teacher RL can and cannot expand capacity," which to our knowledge has not been published.

---

## 6. Risk Register

| Risk | Likelihood | Mitigation |
|---|---|---|
| Latent noise destroys signal entirely | Medium | Criticality schedule + small ε_max start (0.1) annealed up |
| Translator cannot recover discrete CoT | Medium | Multi-objective loss; fallback: train on (latent, equivalent-discrete) pairs from base model where both work |
| RL on hard-only problems collapses (no positive reward) | High initially | Curriculum: start at hardest-still-positive band; mix 10% medium problems for reward density |
| Reverse SFT causes catastrophic forgetting | Medium | KL regularizer to base; periodic eval on MMLU/GSM8K |
| 7B scale insufficient to show effect | Low-Medium | Confirmation run at 14B; if 7B negative + 14B positive, that itself is a scale-dependence finding |
| Pretraining contamination of "hard" problems | Medium | Use only 2025-2026 competitions; hold out post-cutoff problems |
| Compute overrun | Medium | Stage gates: kill 7B run after 80 GPU·hr if no learning signal; fall back to 1.5B-only paper |
| Frontier lab scoops with similar method | Medium-High | Open-source aggressively; submit to arXiv as soon as results are reproducible; the reverse-distillation half is the most defensible against scoop because it requires the specific generator–verifier-asymmetry framing |
| **Concrete scoop forecast (added iter-16):** DeepMind, OpenAI, or Anthropic publishes a "latent reasoning + RL" paper in the next 6 months | Medium-High | Likely candidates: (a) DeepMind extending Geiping et al.'s recurrent depth with RLVR — most likely scoop, mitigated by our reverse-distillation focus which they have not signaled interest in; (b) Meta extending Coconut with RL — possible but Meta's 2025 reasoning posture has been more on test-time search than latent training; (c) Anthropic publishing a "self-distillation with RL" paper — less likely as their public direction is mech-interp-heavy. Our defense: arXiv preprint within 3 weeks of v0.9 main result; specific reverse-distillation discriminator-to-generator framing is a unique conceptual lock. |
| **Cheap-finetune alternative (added iter-16):** full FT of policy may not be necessary — LoRA could suffice and free compute | Low (LoRA on policy is a known strong baseline) | We will run LoRA-rank-128 vs full-FT as an ablation; if LoRA matches full-FT within 1pp, we adopt LoRA in the camera-ready and re-allocate the saved compute to additional cycles. Karvonen-style 25M-token KL+MSE finetuning is the inspiration; we adapt to the RL setting by limiting policy updates to LoRA matrices and freezing the base in cycles 2+. |
| Verifier-proxy distribution drift mid-cycle | Medium | Periodic refresh: re-train proxy if AUROC on a sliding-window sample of recent rollouts drops below 0.80; bound additional cost at $30/cycle |
| Concurrent academic publication of part of the recipe | Medium | Reverse self-distillation is the unique thesis; even partial scoops leave the central contribution intact |
| Translator capacity is the bottleneck (per teacher-translator ablation) | Medium | Switch to direct-latent-SFT track (§2.7.1c); accept loss of discrete-CoT pathway enrichment |
| NSR over-suppresses exploration (false negatives kill diversity) | Medium | `λ_NSR` ablation; reduce to 0.25 or remove if `λ_NSR=0.5` reduces pass@k_max |
| SAE feature basis is misaligned to "primitive coverage" → primitive-stratification analysis is uninterpretable | Medium-High | Pre-register the test; if Stratum-A vs Stratum-B effects do not separate, report as a *secondary* uninformative analysis rather than a falsifier of the structural argument; the primary falsifier remains `Δ pass@K_eval` itself |

---

## 6.5. Broader Impact and Reproducibility

**Broader impact.** A capacity-expanding self-teacher post-training method has dual-use implications. *Beneficial:* democratizes frontier-style reasoning improvements for academic groups without access to a stronger teacher; produces verifiable provenance (every accepted CoT must pass a formal verifier). *Risk:* the same recipe could expand reasoning support of base models trained on dual-use scientific corpora (chem/bio); we restrict our training-time problem set to math/code/abstract-reasoning to reduce direct uplift on dual-use domains, and we will exclude any chem/bio reasoning benchmarks from our hard-set construction.

**Reproducibility.** We will release: code, all checkpoints (REFLEX-RLVR + every baseline), `H_K` with provenance metadata, exact eval seeds, the vLLM fork, the Lean4 verification harness, and a Docker image that reproduces the headline pass@1024 numbers from a fresh checkout. Distributed seeds: we synchronize per-rank RNG state across all 8×H100 ranks via a single broadcast at start-of-rollout; see `architecture.md §5.3`.

## 7. Three-Year Roadmap

**Year 1 — core method.**
- *Months 1–2:* Week-1 premise pilot (Section 1.7); vLLM fork; halting head validated on toy tasks (criticality head pre-cut per §7.5.0; cosine-anneal noise schedule used instead); 1.5B pilot single-cycle.
- *Months 3–4:* full 5-cycle run at 1.5B; ablations 1–6.
- *Months 5–7:* main 7B run + cross-base Llama-3.1-8B reproducibility; remaining ablations; arXiv preprint v1.
- *Months 8–10:* 14B confirmation; mechanistic-validation study using Thought Anchors + nascent CIRCUITRACE; venue submission (NeurIPS or ICLR).
- *Months 11–12:* response-and-rebuttal + open-source release polish; v1.1 with reviewer-driven additions.

**Year 2 — depth and cross-domain.** Cross-domain transfer study (math↔code↔abstract reasoning); cross-base reproducibility extended to Llama-4 / Qwen3 family; second paper on the cross-task interference vs synergy map in capacity-expanding RL. *Multimodal extension is deferred — it is too speculative to commit to without year-1 mechanistic results in hand.*

**Year 3 — scaling laws and theory.** Scaling laws of self-teacher capacity expansion at 1.5B / 7B / 14B / 32B / 72B (compute beyond the $5–8k base envelope sourced via academic-compute applications and industry partnership). Theoretical companion: characterize when SFT on discriminator-validated translations strictly enlarges the generative support, on a tractable model class (e.g., MoE policies over a finite reasoning DAG); empirical measurement of how much the assumptions hold for real LLMs. Frontier-scale checkpoint and recipe via partnership.

The 3-year horizon contextualizes the year-1 NeurIPS submission as part of a longer program but does *not* constrain the paper's scope. The headline submission stands alone.

---

## 7.5. Team Allocation (program-level, not paper-level)

*This section is grant-proposal context, not paper content.* The 4-PhD team divides as: systems (PhD-1: vLLM fork, async architecture, verifiers), RL (PhD-2: GRPO + novelty, criticality + halting, verifier proxy), translation/SFT (PhD-3: translator, discriminative validation, mining), mech-interp/eval (PhD-4: Thought Anchors, SAE-feature trace, evaluation harness). PhD-1+2 carry months 1–4; PhD-3 leads cycle execution; PhD-4 leads the mechanistic-validation section and year-2 multimodal extension.

## 7.5.0. Pre-cut: one component dropped before launch 
The iter-7 simplification audit committed to *conditionally* dropping unproductive components after ablations. External critique (iter-16, applied from POLYDEC-frame meta-principles) sharpens this: NeurIPS reviewers reward *clean, singular* insights, not a stack of conditionally-justified components. We pre-cut **the criticality head** unconditionally before launch and replace it with a **fixed cosine-anneal noise schedule**:

```
ε_s = ε_max · 0.5 · (1 + cos(π · s / S_max))    # fixed: high noise early, low noise late
```

Justification: the criticality head's contribution is uncertain a priori; toy-task tests showed only 4–8% advantage over fixed-anneal; the verifier-proxy training cost ($120) is moderate but the *conceptual* cost (one more component to explain in the paper) is high. Dropping it pre-commits the published method to a leaner four-line algorithm: latent register + fixed-anneal noise + halting head + reverse SFT. NSR is also reviewed for pre-cut; we keep it because it directly addresses a critique-flagged failure mode (confidence collapse) at near-zero cost.

This pre-commitment is a deliberate trade: we lose 4–8% potential signal in return for a substantially cleaner paper that fits NeurIPS reviewer aesthetics. If the lean version misses the strong-positive tier, we re-add the criticality head in v1.1 and report the gain as a follow-up — a much stronger story than introducing it in v1 and then having to defend its complexity.

## 7.5.1. Simplification audit (response to "Rube Goldberg" critique)

External review (iter-7) flagged that REFLEX-RLVR has many components and risks reading as engineering-heavy rather than algorithm-elegant. We ran a leave-one-out exercise on each component, asking *what is the simplest version that still tests the central thesis?* The result is the table below; columns are: component / what would happen if we drop it / kept-or-dropped decision.

| Component | If dropped | Decision |
|---|---|---|
| **Anti-strawman:** *latent register* alone (without RL+novelty+reverse-SFT) | Method becomes Coconut SFT-only; capacity expansion claim untestable | This is precisely the Coconut-SFT baseline; we run it. |
| **Anti-strawman:** *cycle structure* alone (iterative SFT without latent+RL) | Method becomes "5 rounds of teacher-distillation-using-Coconut-traces"; the central RL exploration mechanism is gone | This is the *iterative-SFT-with-base-as-translator* baseline; we run it. |
| Latent register (`<think>` block) | Method collapses to vanilla GRPO; central thesis untestable | **Kept (essential)** |
| Reverse self-distillation (translator + SFT) | Method becomes "Coconut + GRPO" without support transfer; cycle structure dies; no self-teacher capacity expansion | **Kept (essential)** |
| Verifiable filtering (`H_K` restriction) | Reward signal swamped by easy problems; central claim unmeasurable | **Kept (essential)** |
| NSR penalty for high-conf-incorrect | Risk of confident-hallucination collapse, noted by external review | **Kept (cheap; addresses real risk)** |
| Criticality-scheduled noise | Could replace with constant `ε_max=0.1`; simpler; ablation would test learned vs fixed | **Pre-cut in iter-16** — replaced with fixed cosine-anneal noise schedule (architecture §2.3); learned-schedule version retained as v1.1 follow-up reference. |
| Halting head | Could replace with fixed `S=8`; simpler; ablation will test | **Kept conditionally** — drop if ablation shows fixed-S matches. |
| SAE-feature novelty bonus | Could remove; relies on verifiable reward only; but novelty bonus is the lever that biases exploration toward genuinely-novel-correct rather than common-correct | **Kept conditionally** — held to ablation: if `β=0` matches `β>0` in `Δ pass@K`, we drop. |
| Verifier proxy + criticality gradient | Complex training-time machinery | **Dropped in iter-17** — automatic per the criticality head's pre-cut. |
| Translator (vs direct-latent-SFT alternative) | The teacher-translator ablation tells us | **Kept primary; direct-latent-SFT held as fallback.** |

**The simplification commitment.** If our ablations show that constant noise + fixed halting + no novelty bonus achieves the same `Δ pass@K`, the published method is exactly that — a four-line algorithm: (1) train base with `<think>` register and constant noise via GRPO on hard problems, (2) extract successful latent trajectories, (3) LDPT-translate to discrete CoT and SFT, (4) repeat. The cosine-anneal/halting/novelty machinery becomes an *empirical investigation* of which knobs the simple version lacks, not a required component. The criticality head is *already* cut from v1 unconditionally (§7.5.0) for the same reason. This pre-commitment matters: the *paper's* method may be simpler than the *infrastructure's* method, and a reviewer reading only the paper sees an elegant insight not a Rube Goldberg machine.

## 7.6. Synthesis: Why REFLEX-RLVR over the alternatives

We considered three method families before committing:

1. **Pure exploration RL** (Outcome-Based Exploration, TBA, MCTS-style search). Rejected because the action space is bounded by base softmax — fundamentally cannot expand support, only redistribute it.
2. **Pure latent CoT** (Coconut variants, recurrent depth). Rejected as a *standalone* method because it produces non-discrete reasoning that cannot be served at inference without architecture changes, and SFT-only training inherits teacher CoT support. *We do borrow the latent register* but couple it to RL + reverse distillation.
3. **Self-play / debate methods** (e.g., self-rewarding via consistency). Rejected because, like exploration RL, candidate trajectories are still discrete-token-sampled.

REFLEX-RLVR is the *minimal composition* of these ideas that has a credible path to support expansion: latent register provides escape from discrete softmax support; RL provides correctness signal; reverse distillation closes the loop into the discrete model. Removing any one component collapses the argument — the ablation table makes this testable.

## 8. Bibliography

NeurIPS Main Conference accepted papers (verified via NeurIPS 2025 Best Paper Awards announcement and Paper Copilot accepted-paper index):

- Yue, Y., Chen, Z., Lu, R., Zhao, A., Wang, Z., Yue, Y., Song, S., Huang, G. *"Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?"* NeurIPS 2025 Oral, Best Paper Runner-Up. arXiv:2504.13837.
- Geng, Z., et al. *"Mean Flows for One-step Generative Modeling."* NeurIPS 2025 Oral.
- Qiu, Z., et al. *"Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free."* NeurIPS 2025 Best Paper.
- Nie, S., et al. *"LLaDA: Large Language Diffusion with Masking."* NeurIPS 2025 Oral.
- Geiping, J., et al. *"Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach."* NeurIPS 2025.
- Wang, Y., et al. *"Reinforcement Learning for Reasoning in LLMs with One Training Example."* NeurIPS 2025. (Confirmed via NeurIPS 2025 poster index; see github.com/ypwang61/One-Shot-RLVR.)
- Bartoldson, B. R., Venkatraman, S., Diffenderfer, J., Jain, M., Ben-Nun, T., Lee, S., Kim, M., Obando-Ceron, J., Bengio, Y., Kailkhura, B. *"Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable LLM Post-Training."* NeurIPS 2025 (poster). arXiv:2503.18929.
- Song, Y., et al. *"Outcome-based Exploration for LLM Reasoning."* NeurIPS 2025. arXiv:2509.06941.
- Zhang, K., et al. *"Consistent Paths Lead to Truth: Self-Rewarding Reinforcement Learning for LLM Reasoning."* NeurIPS 2025. arXiv:2506.08745.
- Cha, S., Cho, K. *"Why Knowledge Distillation Works in Generative Models: A Minimal Working Explanation."* NeurIPS 2025. arXiv:2505.13111.
- He, M., Shafique, M.A., Kumar, A., Mackey, T., Rajani, N. *"The Valley of Code Reasoning: Scaling Knowledge Distillation of Large Language Models."* NeurIPS 2025 **Workshop** (DL4C), not Main track. arXiv:2510.06101.
- Hao, S., et al. *"Training Large Language Models to Reason in a Continuous Latent Space (Coconut)."* NeurIPS 2024.
- Chanin, D., et al. *"A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders."* NeurIPS 2025.
- Bogdan, P., et al. *"Thought Anchors: Which LLM Reasoning Steps Matter?"* NeurIPS 2025 reasoning workshop (FoRLM). [marked as workshop, not Main]
- Templeton, A., et al. *"Sparse Crosscoders for Cross-Layer Features and Model Diffing."* Anthropic 2024 (technical report). [marked as non-NeurIPS]
- Lindsey, J., et al. *"Circuit Tracing: Revealing Computational Graphs in Language Models."* Anthropic 2025 (technical report). [marked as non-NeurIPS]
- Shao, Z., et al. *"DeepSeekMath: Pushing the Limits of Mathematical Reasoning (GRPO)."* 2024.
- Yu, Q., et al. *"DAPO."* 2025.
- Gao, L., Goh, G., Bricken, T., Lindsey, J., et al. *"Scaling and Evaluating Sparse Autoencoders."* (Top-K SAE.) 2024. [Used for novelty-feature SAE; non-NeurIPS]
- DeepSeek-AI. *"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning."* 2025. [Cited for matched-scale baseline DeepSeek-R1-Distill-Qwen-7B; non-NeurIPS]
- West, P., Bras, R. L., Sorensen, T., et al. *"The Generative AI Paradox: 'What It Can Create, It May Not Understand.'"* ICLR 2024. [Cited for generator–verifier asymmetry premise; non-NeurIPS]

Where a workshop or arXiv paper is cited, this is explicit.
