# Go Space Experiment Inventory

## Scope and source coverage

This updated inventory combines two evidence sources: the completed result artifacts already present in the workspace and a direct browser extraction of the 10 Perplexity experiment threads supplied by the user. The linked threads were all accessible in the browser pass after dismissing login modals. The earlier Space homepage crawl was incomplete because the generic Space Threads tab showed no visible threads, but the explicit thread URLs resolved successfully.

The source links covered are:

| Link | Short ID | Thread topic | Status in this inventory |
|---|---|---|---|
| [Q-Former architecture](https://www.perplexity.ai/search/61c39267-1363-4a56-89a7-3009feaceb8b) | `61c39267` | Q-Former adapter from KataGo trunk features to LLM tokens | Design-only architecture thread |
| [GPT-5.5 + KataGo build guide](https://www.perplexity.ai/search/7c29e5d3-bda1-4894-8662-72a53f8cade9) | `7c29e5d3` | GPT-5.5/KataGo verifiability, Mastermind-Go-style architecture, hidden-state intervention design | Design and literature-derived numerics |
| [Claim consistency PyTorch experiment](https://www.perplexity.ai/spaces/go-7Bj2MY2ATv6htB0IVU2VSw/bf412963-1168-4efa-9aae-197ff63d6ea3) | `bf412963` | Minimal synthetic claim-consistency coupling experiment and FEVER implementation plan | Completed synthetic smoke run plus FEVER code plan |
| [KataGo/code generalization thread](https://www.perplexity.ai/search/7eae3017-9f02-4a34-bd50-742df95924e6) | `7eae3017` | KataGo/FEVER result recap and code-domain generalization design | Session-derived result summary plus code design |
| [KataGo win-probability metrics](https://www.perplexity.ai/search/e75fe5d3-5fa5-484f-8a3d-20ad8a9ca0af) | `e75fe5d3` | KataGo win-probability claim-bin experiment and early synthetic critique | Completed KataGo result |
| [Go causal mask and multi-head claims](https://www.perplexity.ai/search/7a631524-50dc-4085-8ee7-4a683ccd1362) | `7a631524` | B/E/C causal mask design and Go multi-head claim results | Completed preliminary Go claim-head result plus scaling diagnosis |
| [RL reasoning LM for LoGos](https://www.perplexity.ai/search/49934aa4-b969-4956-895f-78e502ee8e35) | `49934aa4` | RL mechanism to fix LoGos sparse explanation rewards | Design-only RL thread |
| [LoGos prefix-locked bottleneck](https://www.perplexity.ai/search/6db52913-3dcd-4dac-9929-4112a9c1e28a) | `6db52913` | LoGos reward limitation and prefix-locked explanation bottleneck | Diagnostic design thread with LoGos reference result |
| [FEVER tightened run](https://www.perplexity.ai/search/eb31d5a7-705b-49b6-b096-392ed76720d7) | `eb31d5a7` | Evidence-only strict masking and matched counterfactuals on FEVER | Completed FEVER tightened run |
| [Codex pilot spec](https://www.perplexity.ai/search/a1b9115c-06de-4fb1-8eb5-2bc9468ec18b) | `a1b9115c` | Tool-augmented two-pass Go reasoning with counterfactual GRPO | Design-only pilot spec |

The inventory distinguishes completed experimental results from proposed designs. “Run” means there was at least a reported metrics table or explicit result summary. “Diagnostic/design” means it was planned, interpreted, or used to guide the paper, but not established as a completed run in the accessible artifacts. “Literature/reference numeric” means the number was cited in the thread as background rather than produced by the project.

## Executive synthesis

Across the project, the central empirical pattern is consistent: a consistency loss applied to the right hidden-state span can make prose/rationale representations strongly predictive of verifier-checked claims, but this does not automatically guarantee useful natural-language explanations. The strongest positive results are the controlled synthetic runs, the LeanCheck formal-verifier run, and the KataGo win-probability run. The code experiment is the most important negative diagnostic: it achieved strong representation-level coupling but failed to produce usable prose, showing that hidden-state coupling is not the same as semantic explanation quality.

The main paper-relevant arc is:

- Synthetic controlled validation: consistency loss creates near-perfect coupling; LM-only baselines can hit perfect generation while bypassing rationales.
- LeanCheck formal verifier: rationale-only and proof-only spans separate perfectly on counterfactuals, with clean controls.
- KataGo dense domain verifier: consistency-trained commentary spans strongly encode win-rate buckets and respond to rationale swaps.
- Code coupling failure: high coupling can coexist with unusable prose, so the diagnostic ladder must include generation quality and causal tests, not only probe accuracy.
- FEVER: useful but weaker and less venue-specific; it demonstrates that real text can preserve some evidence sensitivity but that evidence-only coupling is much harder when claim-text shortcuts exist.

## Master experiment list

| Family | Status | Main setup | Key result | Main takeaway |
|---|---:|---|---|---|
| Synthetic smoke test | Run | Toy latent states, rationale tokens, claim tokens | Consistency variants hit 100% rationale-pooled classifier accuracy; baseline 3.9% | First proof that consistency loss can couple rationale hidden states to claims |
| Synthetic scaled convergence | Run | Larger synthetic run, more epochs/data | All variants reached 100% generation; baseline stopped reading rationales | LM objective can solve generation while bypassing rationales, motivating consistency loss |
| Synthetic hidden-state intervention | Run | Patch rationale hidden states between examples | Consistency variants show 73-89% layer-0 patch effect; baseline 31% | Causal evidence that consistency-trained rationale states influence predictions |
| Synthetic claim-only pooling control | Run | Consistency loss trained on claim positions, evaluated on rationale pool | 43% rationale-pooled classifier accuracy, 48% swap-following | Pooling span matters; wrong span weakens coupling |
| Synthetic overlapping-vocab hard run | Run | 50% overlapping rationale vocabulary | Rationale-only, full-sequence, earlier-token-only hit 100% classifier/swap coupling; baseline 4.69% classifier | Coupling survives distributional overlap and is not just unique-token lookup |
| Generated rationale + scalar claim | Run | Model generates rationale plus scalar-bin claim | Rationale-only/full-consistency hit 100% claim-bin accuracy; baselines near-random | Best controlled instance of the paper’s “generated prose plus verifiable claim” mechanism |
| FEVER pretrained GPT-2 | Run | FEVER evidence/claim/label sequence with pooling variants | Full/claim pooling around 83-84% classifier; evidence-only around 44% | Real text introduces claim-token shortcuts; evidence-only coupling is weaker |
| FEVER tightened rerun | Run | Strict evidence-only masking, matched counterfactuals, random labels | Evidence-only strict: 43.8% classifier, 46.4% matched swap; random labels 23.1% | Evidence-only path is real but not high-accuracy; random-label control behaves correctly |
| FEVER from-scratch | Run | From-scratch FEVER model | Evidence-only remained weak around 44% classifier and 39% swap | Weak FEVER evidence coupling was not only a pretraining shortcut |
| KataGo win-probability | Run | Position tokens, commentary/rationale, win-probability bin and scalar | Full consistency: 81.1% claim-bin accuracy; no-consistency 25.1% | Strong real-domain oracle result; commentary hidden states carry KataGo win-rate information |
| Go GPT-OSS 1k/200 multi-claim | Run | Finetune on Go claim heads over 1k train/200 eval | Eval accuracy rose to 56.17%; macro-F1 peaked at 32.06% around epoch 4 | Model learns some claim heads but suffers class collapse; needs more/balanced data |
| Go GPT-OSS per-claim epoch-4 | Run | Per-claim eval summary | Global contestedness works best: 86.5% acc, 76.5% macro-F1; many other heads collapse | Some claims are learnable; rare/fine labels need rebalancing or redesign |
| Code V1 | Run | Algorithm explanations plus 3 structured claims | Full consistency: 100% coupling and 100% swap; baseline: 17.5% coupling and 5% swap | Mechanism works technically on code hidden states |
| Code rich ontology | Run | 12 richer code claim types, several ablations | Main coupling 98.6%, swap 97.8%, but manual usable prose 0/20 | Richer claims do not rescue prose with scratch model |
| Code strict-flow/surface bottlenecks | Run | no-claim-to-claim, claims-from-explanation-only, surface bottleneck variants | Strict hidden-state variants stay near 100% coupling; surface bottleneck weaker | Hidden-state coupling can persist through masks; surface-form coupling is much harder |
| LeanCheck | Run | Lean theorem, proof, rationale, VERIFIES/FAILS label | Rationale-only and proof-only each reach 100% span-specific counterfactual behavior | Cleanest formal-verifier separation result |
| LeanCheck activation patching | Run/preliminary | Minimal-pair activation patching | Rationale-only head patch: +4.518 rationale-minus-random; controls near zero | Preliminary causal support; head patching is more diagnostic than LM logits |
| Q-Former KataGo adapter | Design | Frozen KataGo trunk `[B, 361, 384]` cross-attended by 32 learned queries | No project run; 361 board intersections compressed to 32 tokens | Candidate architecture for richer Go grounding |
| Mastermind-Go-style KataGo LLM | Design/reference | Four-task curriculum, LoRA SFT/DPO, frozen KataGo trunk + adapter | Reference numerics: Task 1 99.44% single-task, 96.08% multi-task; Task 2 score MAE 1.74 | Prior architecture informs Go data/curriculum choices |
| LoGos reward limitation | Diagnostic/design | Outcome-level GRPO gives identical reward to explanation tokens | Reference finding: LoGos explanations correct only 55.6% even when moves are correct | Motivates process/counterfactual rewards and prefix bottlenecks |
| Prefix-locked explanation bottleneck | Design | `board description + explanation tokens + [MOVE]` | No run; proposed deletion/bottleneck tests | Tries to make explanation lie on causal path to move |
| Codex two-pass Go RL pilot | Design | 26-field concept schema, deterministic tools, KataGo outcome reward, counterfactual GRPO | No run; reward and success criteria specified | Best next-step RLVR experiment after SFT ceiling |
| Go deep-variation/RL/KataGo tool-use | Design | Top-k PVs, ownership, score, RLVR/GRPO reward via KataGo | No final metrics in accessible artifacts | Strong future direction, not a completed experimental result |

## Linked design and architecture threads

### Q-Former adapter for KataGo trunk features

This thread did not report a completed experiment. It specified an architecture for converting KataGo board features into LLM-consumable tokens. The frozen KataGo trunk produces a spatial tensor with shape `[B, 361, 384]`, representing 361 board intersections with 384-dimensional features. A Q-Former with 32 learnable query vectors cross-attends over the 361 trunk positions, producing 32 fixed output tokens that are projected into the LLM token space.

Key design findings:

- Q-Former is preferred over a linear projector because it can actively select, pool, and reorganize board information.
- The 32 query tokens can specialize into different board concepts, such as local tactics, territory, influence, or candidate-move regions.
- The design trades information capacity against LLM context length: too few queries may lose board detail, while too many queries make the decoder’s job harder.
- No numeric ablation was run in this thread. The only numerics are architectural: 19×19 board = 361 intersections, trunk feature shape `[B, 361, 384]`, and proposed compressed token count = 32.

### GPT-5.5 + KataGo verifiability and Mastermind-Go-style build guide

This thread combined a verifiability assessment with a full build guide for a KataGo-grounded transformer. It argued that GPT-5.5 plus KataGo can make numeric claims traceable to KataGo outputs, but that natural-language conceptual explanations remain LLM-generated and are not themselves verified unless they are tied to explicit claims.

Reference numerics and data sizes cited in the thread:

| Item | Reported value | Role in project |
|---|---:|---|
| GPT-5.5 Pro Terminal-Bench 2.0 | 82.7% | Background capability reference |
| GPT-5.5 Pro OSWorld-Verified | 78.7% | Background capability reference |
| General LLM next-move accuracy vs KataGo top-10 | below 35% | Motivation for Go-specific grounding |
| Mastermind-Go Task 1 state transition | 99.44% single-task, 96.08% multi-task | Prior result motivating board-state curriculum |
| Mastermind-Go Task 2 KataGo analysis | score MAE 1.74 points | Prior result motivating oracle-supervised claims |
| Mastermind-Go Task 3 natural-language explanation | 1,503 book samples | Data-starvation warning |

The proposed four-task curriculum was:

- Task 1: 150,000+ KataGo self-play examples for `(board_state, move) → next_board_state`.
- Task 2: 138,693 samples from 36 KataGo trajectories for board-to-KataGo analysis.
- Task 3: 1,503 Go book samples for natural-language explanations.
- Task 4: integrated chain-of-thought combining board state, KataGo analysis, and explanation.

Architecture variants proposed but not run in this project thread were: prompt-only GPT-5.5 + KataGo MCP; SFT on Tasks 1-2; full 4-task SFT; full SFT plus DPO with dan feedback; and a frozen KataGo trunk plus adapter plus LLM decoder. The thread identified the frozen KataGo trunk plus adapter plus LLM decoder as the strongest architecture because it preserves spatial board features while still allowing language generation.

### Causal mask design for board, explanation, and claims

The Go causal-mask thread formalized a sequence layout with board tokens, explanation tokens, and claim tokens. The intended attention mask was:

```text
q in Board:       attend to Board tokens up to q
q in Explanation: attend to Board plus Explanation tokens up to q
q in Claim:       attend only to Explanation tokens
```

The corresponding training loss was `L = L_explanation + λ_c × L_claims`. The purpose was to make claim predictions depend on explanation hidden states rather than a direct board or claim-token shortcut. The thread also recommended modular inline claims attached to local explanation units, rather than one monolithic claim block with many heads.

### RL reasoning LM and LoGos sparse-reward limitation

Two linked threads analyzed why outcome-level Go RL does not guarantee faithful explanations. The central diagnosis was that LoGos-style GRPO rewards the final move or win-rate outcome, so every explanation token receives the same group-relative advantage signal. This can make explanations decorative: the model may learn to produce a good move from board context while writing plausible but unfaithful commentary.

The key reference result cited for LoGos was that human evaluators found only 55.6% of explanations correct even when the move prediction was right. This number was used as motivation rather than as a project-run result.

The proposed fix was a prefix-locked explanation bottleneck:

```text
board description + explanation tokens + [MOVE]
```

In this design, the model cannot emit the move until after generating the explanation. The move logits are computed from the sequence containing the explanation, so the reward for move quality can backpropagate through explanation tokens. The thread also emphasized that this is not sufficient by itself, because explanation token hidden states can still hide board information that is not human-readable. Therefore, the proposed diagnostic test is to delete or corrupt explanation tokens and check whether move quality collapses; if it does not, the explanation is still decorative.

### Codex pilot: tool-augmented two-pass Go reasoning with counterfactual GRPO

The Codex pilot spec was a design-only thread for a stronger future experiment. It proposed a two-pass system:

- Pass 1: generate a 26-field concept JSON schema.
- Pass 2: condition on the concept schema to predict top principal variations, territory/ownership, score lead, and win rate.

The concept schema was divided into three tiers:

| Tier | Contents | Verifier source |
|---|---|---|
| Tier 1 | Liberties, atari, group size, ladders, ko, capture race, nets | Deterministic Go tools |
| Tier 2 | Move pattern name, shape type, joseki deviation | Boardmatcher-style pattern tools |
| Tier 3 | Sharpness, score uncertainty, utility gaps, ownership deltas, win rate, score lead, group status | KataGo oracle outputs |

The deterministic tool interface included `count_liberties`, `is_in_atari`, `get_group_status`, `check_ladder`, `is_ko_active`, `count_territory`, and `check_capture_race`. The intended training setup was QLoRA on Qwen3-8B non-thinking mode, with Qwen2.5-7B-Instruct as fallback, running on 2× A100 80GB with one GPU for generation and one for GRPO training.

The reward design was:

```text
R_out = 0.40 × R_pv + 0.25 × R_own + 0.20 × R_score + 0.15 × R_wr
R_total = R_fmt × (0.4 × R_out + 0.35 × R_proc + 0.25 × R_cf)
```

The data mix was proposed as 40% professional games, 40% synthetic critical positions, and 20% LoGos training positions. The pilot success criteria were: move quality at least matching LoGos-style baseline, concept accuracy beating no-tool baseline by at least 5 percentage points, and positive counterfactual delta showing that concept fields causally affect Pass 2 outputs.

## Synthetic claim-consistency experiments

### Initial smoke test

The initial synthetic task used sequences of the form `[BOS] prompt [SEP] rationale [SEP] claim`, with 8 latent states and deterministic state-specific claim tokens. The main variants were `no_consistency_loss`, `rationale_only`, `full_sequence`, and `earlier_token_only`.

Reported smoke-test findings:

| Variant | Rationale-pooled classifier accuracy | Generation accuracy | Counterfactual classifier follows swapped rationale | Finding |
|---|---:|---:|---:|---|
| `no_consistency_loss` | 3.9% | 75.0% | weak/noisy | LM can partly generate claims without encoding rationale state |
| `rationale_only` | 100% | about 63% | 100% | Direct rationale pooling forces perfect rationale-state encoding |
| `full_sequence` | 100% | 93.8% | 100% | Strongest early generation accuracy |
| `earlier_token_only` | 100% | about 62% | 100% | Prompt+rationale span also supports coupling |

Important finding: all models performed near chance on shuffled rationale-claim pairings, around 8-10%, which supported the view that consistency-trained classifiers were reading rationale content rather than memorizing global label frequencies.

### Scaled synthetic convergence

The scaled run increased data and training time. The key result was not simply higher accuracy; it revealed a failure mode of the baseline.

Reported findings:

- All four variants reached 100% generation accuracy.
- The no-consistency baseline’s rationale-pooled classifier accuracy stayed near chance at 8.8%.
- The no-consistency baseline’s counterfactual classifier swap-following dropped to 7.8%.
- The no-consistency baseline showed 0% generation following swapped rationales.
- Consistency variants retained 100% classifier accuracy and 100% counterfactual swap-following.

Interpretation: the LM objective can converge to perfect claim generation while bypassing rationales entirely. This is one of the strongest motivations for a separate consistency objective: generation accuracy alone is not evidence of faithful rationale use.

### Hidden-state intervention / causal patching

The hidden-state intervention patched rationale hidden states from one example into another, measuring whether claim prediction followed patched hidden states or surface tokens.

Reported findings:

| Variant group | Layer-0 intervention follows patched original hidden state | Interpretation |
|---|---:|---|
| Consistency-trained variants | 73-89% | Patching rationale states can causally redirect prediction |
| `no_consistency_loss` baseline | 31% | Baseline weakly responds to patched rationale states |
| Chance for 8 states | 12.5% | Baseline is only modestly above chance |

Layer-1 patching showed 0% following original hidden states and 100% following swapped tokens, interpreted as an artifact of the final computation path and patch point rather than a contradiction of the layer-0 effect.

### Claim-only pooling negative control

The `claim_only_pooling` control trained the consistency loss on claim token hidden states but evaluated on rationale-pooled hidden states. This created a shortcut where the model could satisfy the auxiliary objective without making rationale states predictive.

Reported findings:

| Metric | `claim_only_pooling` | Rationale-trained variants | Baseline |
|---|---:|---:|---:|
| Rationale-pooled classifier accuracy | 43% | 100% | 3.9% |
| Counterfactual classifier follows swapped rationale | 48% | 100% | weak/noisy |
| Counterfactual classifier follows original | 8% | 0% | weak/noisy |
| Generation accuracy | 47% | 62-94% in smoke | 75% |

Interpretation: the wrong pooling span partially leaks information but does not produce reliable rationale-claim coupling. This is a clean control showing the target span matters.

### Hard overlapping-vocabulary synthetic run

The hard run made rationale token vocabularies overlap by about 50% across latent states. Templates used shared tokens appearing in all states, group tokens shared by adjacent states, and local tokens unique to a state. The model therefore could not rely on one unique token per state.

Configuration:

| Parameter | Value |
|---|---:|
| train/eval/shuffled | 512 / 128 / 128 |
| latent states | 8 |
| templates per state | 4 |
| epochs | 10 |
| model | 2-layer, d_model 64, 4 heads |
| consistency loss weight | 0.5 |
| overlap fraction | 0.5 |

Results:

| Variant | Gen claim acc | Rationale-pool cls acc | CFact gen swap | CFact cls swap | Shuffled cls acc |
|---|---:|---:|---:|---:|---:|
| `no_consistency_loss` | 1.0000 | 0.0469 | 1.0000 | 0.0625 | 0.1719 |
| `rationale_only` | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 0.1016 |
| `full_sequence` | 0.8125 | 1.0000 | 0.7656 | 1.0000 | 0.1016 |
| `earlier_token_only` | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 0.1016 |

Interpretation: the consistency-trained variants still achieved perfect hidden-state coupling despite overlap, while the baseline reached perfect generation but near-zero rationale-pooled coupling. This supports the claim that the mechanism learns co-occurrence/state structure rather than trivial unique-token lookup.

## Generated rationale plus scalar claim

This experiment was designed to match the intended general recipe more closely than FEVER. The model generated both rationale tokens and a scalar-like claim token. An oracle-like mapping assigned each latent state a scalar in `[0,1]`, discretized into 10 bins.

Variants:

- `lm_only`: language modeling only.
- `no_consistency_loss`: LM plus scalar regression from claim position.
- `rationale_only`: LM plus bin prediction from rationale-pooled states.
- `full_consistency`: LM plus scalar regression plus rationale consistency.
- `random_consistency`: full setup but consistency labels randomized.

Results:

| Variant | Token acc | Claim-bin acc | Scalar MSE | CFact cls follows swap | CFact cls follows orig |
|---|---:|---:|---:|---:|---:|
| `lm_only` | 0.8231 | 0.0000 | 0.12835 | 0.0313 | 0.1172 |
| `no_consistency_loss` | 0.8197 | 0.0410 | 0.00014 | 0.1602 | 0.0352 |
| `rationale_only` | 0.8199 | 1.0000 | 0.09820 | 0.0000 | 1.0000 |
| `full_consistency` | 0.8166 | 1.0000 | 0.00018 | 0.0000 | 1.0000 |
| `random_consistency` | 0.8190 | 0.1641 | 0.00029 | 0.0664 | 0.1406 |

Findings:

- Rationale-only and full-consistency reached 100% bin accuracy from rationale hidden states.
- Token-level accuracy remained around 82% for all variants, so consistency improved coupling rather than ordinary LM accuracy.
- Full consistency preserved low scalar MSE while also producing perfect bin coupling.
- Random labels and no-consistency did not produce reliable rationale-bin coupling.

Important caveat: the counterfactual direction differs from the KataGo setup. In this synthetic generated-rationale run, consistency-trained variants stayed faithful to the original latent state under the reported counterfactual construction rather than following swapped surface rationales. The useful paper-level point is that the rationale representation fully encoded the verifier-relevant scalar bin while baselines did not.

## FEVER experiments

### Pretrained GPT-2 FEVER run

The first FEVER run used a 50k training / 5k evaluation setup with FEVER-style evidence, claim, and label tokens. The goal was to see whether the synthetic metric vocabulary transferred to a real fact-verification task.

Results from `fever50k_full_20260428.csv`:

| Variant | Gen claim acc | Cls claim acc | CFact cls swap | CFact cls orig | CFact gen swap | CFact gen orig | Shuffled cls acc |
|---|---:|---:|---:|---:|---:|---:|---:|
| `no_consistency_loss` | 0.8040 | 0.3980 | 0.3660 | 0.3000 | 0.2820 | 0.4540 | 0.3040 |
| `evidence_only_pooling` | 0.8160 | 0.4410 | 0.4800 | 0.2520 | 0.3000 | 0.4520 | 0.3100 |
| `full_sequence_pooling` | 0.8300 | 0.8404 | 0.3020 | 0.4420 | 0.2820 | 0.4460 | 0.4660 |
| `claim_only_pooling` | 0.8260 | 0.8334 | 0.2860 | 0.4660 | 0.2940 | 0.4700 | 0.4740 |

Findings:

- Full-sequence and claim-only pooling performed best on classifier accuracy, around 83-84%.
- Evidence-only pooling only reached 44.1% classifier accuracy, modestly above baseline.
- Counterfactual metrics were not cleanly variant-separated.
- Claim-only pooling behaved like a strong model, not a negative control, because it could read the answer-bearing label/claim region.

Interpretation: FEVER showed label conditioning and some transfer to real text, but not the clean evidence-only coupling story. The task permits shortcuts through claim text and label priors.

### Tightened FEVER rerun

The tightened rerun added `evidence_only_strict`, matched-claim counterfactuals, and an evidence-only random-label control.

Results from `fever50k_rerun_20260428.csv`:

| Variant | Gen acc | Cls acc | CFact cls swap | CFact cls orig | Matched swap | Matched orig | Shuffled cls |
|---|---:|---:|---:|---:|---:|---:|---:|
| `no_consistency_loss` | 0.8040 | 0.5278 | 0.3500 | 0.3760 | 0.3160 | 0.3660 | 0.4420 |
| `evidence_only_pooling` | 0.8040 | 0.4412 | 0.4620 | 0.2660 | 0.4580 | 0.2340 | 0.3120 |
| `evidence_only_strict` | 0.8140 | 0.4384 | 0.4660 | 0.2620 | 0.4640 | 0.2300 | 0.3060 |
| `full_sequence_pooling` | 0.8220 | 0.8358 | 0.2900 | 0.4400 | 0.3440 | 0.4460 | 0.4560 |
| `claim_only_pooling` | 0.8380 | 0.8372 | 0.2960 | 0.4520 | 0.3380 | 0.4600 | 0.4720 |
| `evidence_only_random_labels` | 0.8060 | 0.2312 | 0.1940 | 0.4420 | 0.2180 | 0.4200 | 0.3640 |

Findings:

- Strict evidence-only behaved almost identically to evidence-only pooling.
- Evidence-only variants were more evidence-sensitive on matched-claim counterfactuals than full/claim pooling.
- Full/claim pooling still dominated raw classifier accuracy.
- Random-label control collapsed to 23.1% classifier accuracy while LM generation stayed around 80.6%, confirming the diagnostic was not purely spurious.

Interpretation: FEVER is a useful diagnostic negative or appendix result. It shows that evidence-only consistency can produce evidence sensitivity, but raw accuracy remains much lower than shortcut-permitting pooling.

### From-scratch FEVER run

A from-scratch FEVER model was tested to see whether pretrained shortcuts caused the weak evidence-only coupling.

Reported findings:

| Condition | Classifier accuracy | Swap-following |
|---|---:|---:|
| From-scratch evidence-only | about 44% | about 39% |
| Pretrained evidence-only | about 44% | about 46% |
| From-scratch full-sequence | about 99% | about 97% |

Interpretation: the evidence-only weakness was not simply caused by GPT-2 pretraining. FEVER’s evidence-claim relationship is too semantically complex for the simple pooled-head mechanism to cleanly recover evidence-local coupling, while full-sequence pooling remains easy because claim/label information is accessible.

## KataGo and Go experiments

### KataGo win-probability coupling

The KataGo experiment used real KataGo-based JSONL splits with 3,373 train and 375 eval examples. The sequence format included position tokens, commentary/rationale tokens, and a win-probability claim bin token. The model had an LM head, a scalar head predicting continuous win probability, and a bin head predicting win-probability bin from pooled rationale hidden states.

Results:

| Variant | Token acc | Claim-bin acc | Scalar MSE | MAE winprob | Pearson r | Spearman r | CFact swap | CFact orig |
|---|---:|---:|---:|---:|---:|---:|---:|---:|
| `lm_only` | 0.48 | 0.02 | 0.1606 | 0.3785 | 0.36 | 0.49 | 0.10 | 0.11 |
| `no_consistency_loss` | 0.48 | 0.25 | 0.0016 | 0.0268 | 0.9953 | 0.9320 | 0.20 | 0.21 |
| `rationale_only` | 0.45 | 0.79 | 0.1807 | 0.3962 | 0.15 | 0.19 | 0.52 | 0.14 |
| `full_consistency` | 0.45 | 0.81 | 0.0020 | 0.0280 | 0.9943 | 0.9176 | 0.55 | 0.14 |
| `random_consistency` | 0.47 | 0.16 | 0.0017 | 0.0284 | 0.9951 | 0.9354 | 0.17 | 0.10 |

Claim-bin accuracy was later restated with exact values:

| Variant | Claim-bin accuracy |
|---|---:|
| `lm_only` | 0.016 |
| `no_consistency_loss` | 0.251 |
| `rationale_only` | 0.787 |
| `full_consistency` | 0.811 |
| `random_consistency` | 0.160 |

Findings:

- Scalar win-rate prediction is easy when directly supervised: `no_consistency_loss`, `full_consistency`, and `random_consistency` all achieve Pearson around 0.995 and MAE around 0.027-0.028.
- Rationale-pooled claim-bin accuracy is the key differentiator: consistency-trained rationale spans jump to about 79-81%, while baselines are around 1.6-25.1%.
- Counterfactual swaps show the bin head follows swapped commentary substantially more often for consistency-trained variants, about 52-55% swap versus 14% original.

Interpretation: this is the main dense-domain verifier result. Consistency loss changes what commentary hidden states encode, even though scalar prediction from the claim position can be solved without it.

### Go GPT-OSS 1k train / 200 eval multi-claim experiment

This was a practical fine-tuning run on a Go claim dataset with about 1k train and 200 eval positions. The reported aggregate metrics across epochs were:

Source reconciliation: the linked causal-mask thread also describes a broader 9,000 train / 1,000 eval Go-position setup with human commentary and KataGo-verified claims, while the workspace/session result table available for exact epoch metrics is the smaller 1k train / 200 eval run. The per-head pattern is consistent across the descriptions: raw accuracy can look acceptable, but macro-F1 exposes class collapse.

| Epoch | Train Acc | Eval Acc | Train Macro-F1 | Eval Macro-F1 |
|---:|---:|---:|---:|---:|
| 1 | 0.4858 | 0.4889 | 0.1953 | 0.1955 |
| 2 | 0.5272 | 0.5228 | 0.2440 | 0.2375 |
| 3 | 0.5561 | 0.5539 | 0.2747 | 0.2674 |
| 4 | 0.5873 | 0.5544 | 0.3964 | 0.3206 |
| 5 | 0.6288 | 0.5617 | 0.4573 | 0.3123 |

Findings:

- Eval accuracy improved through epoch 5, but by a small amount after epoch 3.
- Eval macro-F1 peaked at epoch 4 and declined at epoch 5.
- The train/eval macro-F1 gap widened by epoch 5, indicating mild overfit.
- Epoch 4 was treated as the best checkpoint by macro-F1.

### Go GPT-OSS epoch-4 per-claim evaluation

Epoch-4 per-claim metrics:

| Claim | Accuracy | Macro-F1 | Main failure |
|---|---:|---:|---|
| `win_prob_bin` | 0.315 | 0.177 | heavy collapse into bin 8 |
| `score_lead_bin` | 0.550 | 0.171 | mostly predicts `LEAD_CLOSE` |
| `phase_estimate` | 0.755 | 0.326 | predicts almost everything as `PHASE_OPENING` |
| `main_control_region` | 0.300 | 0.191 | overpredicts `CTRL_BOTTOM_B` |
| `main_contested_region` | 0.450 | 0.342 | confuses center/left/right/top |
| `global_contestedness` | 0.865 | 0.765 | working best |
| `best_move_region` | 0.550 | 0.276 | mostly predicts bottom/top, never right |
| `move_urgency` | 0.540 | 0.324 | underpredicts `URG_HIGH` and `URG_MED` |
| `search_surprise` | 0.665 | 0.313 | mostly predicts `SURPRISE_LOW` |

Findings:

- `global_contestedness` is the one clearly strong head.
- Many other heads show majority-class collapse despite reasonable raw accuracy.
- Macro-F1 is much lower than accuracy for most heads, indicating class imbalance and underuse of rare labels.
- Scaling to 10k+ positions with stratified sampling, class weights/focal loss, and early stopping was recommended.

## Code coupling experiments

### V1 standard ablation ladder

The first code run used algorithmic functions with structured claims about time complexity, space complexity, and correctness. The model was a scratch GPT-2-style causal Transformer.

Reported final findings:

| Variant | Coupling strength | Counterfactual swap influence | BLEU-1 | Claim accuracy |
|---|---:|---:|---:|---:|
| `consistency_loss` | 100% | 100% | 0.0673 | 100% |
| `no_consistency_loss` | 17.5% | 5% | 0.0600 | not emphasized |

Interpretation: the basic code setup confirmed that consistency loss can make explanation hidden states predictive of oracle claims, while LM-only training does not.

### V2 strict-flow / bottleneck ablations

Final-epoch validation metrics for V2 variants:

| Variant | Coupling strength | BLEU-1 | ROUGE-L | Swap influence | Claim accuracy | Val LM loss |
|---|---:|---:|---:|---:|---:|---:|
| `no_claim_to_claim_attention` | 1.0000 | 0.0758 | 0.0998 | 1.0000 | 1.0000 | 0.0781 |
| `claims_from_explanation_only` | 1.0000 | 0.0708 | 0.0810 | 1.0000 | 0.8000 | 0.0765 |
| `surface_bottleneck_consistency` | 0.6967 | 0.0677 | 0.0889 | -0.1500 | 1.0000 | 0.0729 |
| `surface_bottleneck_no_expl_lm` | 0.8080 | 0.0033 | 0.0014 | -0.0500 | 0.1000 | 0.0570 |

Findings:

- Hidden-state bottleneck variants can preserve perfect coupling under stricter attention constraints.
- Surface-bottleneck variants are weaker, especially when explanation LM loss is removed.
- Claim accuracy can fall when claims are forced through explanation-only pathways.

### Rich 12-claim ontology experiment

The rich code experiment expanded from 3 claims to 12 verifiable properties, including complexity, algorithm class, loop structure, key operation, access pattern, auxiliary structures, mutation, correctness status, empty-input handling, and duplicate handling.

Summary table:

| Variant | Mean coupling | Swap influence | BLEU-1 | ROUGE-1 |
|---|---:|---:|---:|---:|
| `consistency_loss` | 0.986 | about 0.978 | 0.0611 | about 0.061 |
| `no_consistency_loss` | 0.286 | about -0.033 | 0.0686 | about 0.062 |
| `claim_only_pooling` | 0.595 | about 0.122 | 0.0554 | about 0.04 |
| `random_label_consistency` | 0.562 | about -0.022 | 0.0593 | about 0.05 |
| `no_claim_to_claim_attention` | 0.986 | about 0.977 | 0.0614 | about 0.05 |
| `claims_from_explanation_only` | 0.986 | about 0.977 | 0.0558 | about 0.05 |

Manual review:

| Variant | Behavior correct | Time correct | Space correct | Bug status correct | Fully correct prose | Prose contradicts claims |
|---|---:|---:|---:|---:|---:|---:|
| `claim_only_pooling` | 0/20 | 5/20 | 16/20 | 17/20 | 0/20 | 17/20 |
| `consistency_loss` | 0/20 | 4/20 | 10/20 | 17/20 | 0/20 | 20/20 |
| `no_consistency_loss` | 0/20 | 5/20 | 11/20 | 17/20 | 0/20 | 17/20 |
| `random_label_consistency` | 0/20 | 5/20 | 9/20 | 17/20 | 0/20 | 17/20 |

Findings:

- Richer claims did not improve natural-language explanation correctness.
- The main variant encoded 98.6% claim information in hidden states and had about 97.8% swap influence, but generated 0/20 fully usable explanations.
- Scratch-trained models often emitted memorized templates and `<sep>`-like degenerate text.
- Structured claims could be correct while prose contradicted them.

Interpretation: this is the core diagnostic negative. Representation-level coupling is not sufficient for explanation quality. A stronger code experiment needs pretrained language ability, adversarial mismatched explanations, strict text-only claim extraction, and semantic evaluation.

## LeanCheck formal-verifier experiment

LeanCheck tests whether natural-language rationale spans encode formal verifier outcomes. Sequences include `[THEOREM]`, `[PROOF]`, `[RAT]`, and `[CLAIM]` sections, with a binary `VERIFIES` or `FAILS` label.

Dataset:

- 1,000 train examples.
- 200 eval examples.
- 200 counterfactual swaps.
- 200 minimal-pair rows.
- Domains: natural-number equalities, propositional logic, simple list lemmas.
- Mutations: wrong lemma, wrong theorem/proof pairing, missing premise, deleted proof line, renamed variable, replacement tactic, adversarial near miss.

Main results:

| Variant | Cls claim acc | Gen claim acc | Cons loss | CFact follows swap | CFact follows orig | Minimal-pair flip |
|---|---:|---:|---:|---:|---:|---:|
| `lm_only` | 0.490 | 0.510 | 0.830 | 0.495 | 0.495 | 0.000 |
| `no_consistency_loss` | 0.510 | 0.525 | 1.053 | 0.505 | 0.505 | 0.000 |
| `rationale_only` | 1.000 | 0.640 | 0.000 | 1.000 | 0.010 | 1.000 |
| `full_sequence` | 1.000 | 0.990 | 0.001 | 0.995 | 0.005 | 1.000 |
| `proof_only` | 1.000 | 0.915 | 0.000 | 0.010 | 1.000 | 1.000 |
| `random_consistency` | 0.410 | 0.775 | 0.770 | 0.400 | 0.590 | 0.070 |
| `wrong_span` | 0.520 | 0.540 | 0.690 | 0.495 | 0.515 | 0.000 |

Key finding: `rationale_only` follows the swapped rationale 100% on counterfactuals, while `proof_only` follows the original proof 100%. This is the cleanest span-specific separation in the whole project.

Activation patching summary:

| Variant | LM rationale-minus-random | Head rationale-minus-random | Interpretation |
|---|---:|---:|---|
| `lm_only` | 5.410 | -0.001 | LM logits move, head does not |
| `no_consistency_loss` | 5.097 | 0.064 | weak/no head-specific effect |
| `rationale_only` | 9.774 | 4.518 | strongest rationale-span head effect |
| `full_sequence` | 5.463 | 0.224 | head effect weaker than rationale-only |
| `proof_only` | -5.208 | -3.758 | effect not rationale-directed |
| `random_consistency` | -7.494 | 0.025 | control near zero on head |
| `wrong_span` | 10.741 | -0.014 | wrong-span head does not show rationale effect |

Interpretation: LeanCheck is a strong AI4Math-facing result because it connects informal rationales to a formal proof-checker outcome without attempting full proof synthesis. The caveat is that templated data and binary accept/reject labels make the task easier than general theorem proving.

## Ablation index

| Ablation/control | Where run | Purpose | Outcome |
|---|---|---|---|
| `no_consistency_loss` | Synthetic, FEVER, KataGo, code, LeanCheck | LM/scalar baseline without rationale consistency | Often generates/claims well but weakly couples rationale states |
| `rationale_only` / `evidence_only` | Synthetic, scalar, FEVER, KataGo, LeanCheck | Force claim/verifier label from rationale/evidence span | Strong in synthetic, scalar, KataGo, LeanCheck; weaker in FEVER |
| `full_sequence` / `full_consistency` | Synthetic, FEVER, KataGo, LeanCheck | Pool full sequence or combine scalar + consistency losses | Often high raw accuracy; can include shortcut access to claim tokens |
| `earlier_token_only` | Synthetic | Exclude claim positions but include prompt/rationale | Perfect coupling in hard overlap |
| `claim_only_pooling` | Synthetic, FEVER, code | Wrong-span or shortcut control | Fails in synthetic, succeeds in FEVER due label access, partially succeeds in code hidden states |
| `random_consistency` / random labels | FEVER, KataGo, code, LeanCheck | Sanity check for label signal | Typically collapses or stays weak on coupling, validating real-label signal |
| `wrong_span` | LeanCheck | Pool theorem span instead of rationale/proof | Near chance; behaves as control |
| `proof_only` | LeanCheck | Formal proof span control | Perfectly follows proof/original rather than rationale swaps |
| `evidence_only_strict` | FEVER | Mask non-evidence states in consistency path | Similar to evidence-only; more evidence-sensitive but low raw accuracy |
| `matched counterfactuals` | FEVER | Hold claim lexically similar while swapping evidence label | Cleanly shows evidence sensitivity for evidence-only variants |
| hidden-state intervention | Synthetic, LeanCheck | Causal test beyond decodability | Stronger evidence for consistency-trained spans |
| surface bottleneck | Code | Force coupling through LM output distributions | Weaker than hidden-state coupling |
| no-claim-to-claim attention | Code | Prevent claims sharing information with other claims | Maintains high coupling |
| claims-from-explanation-only | Code | Force claim path through explanation hidden states | Maintains coupling, may hurt claim accuracy |

## Proposed or design-only threads

Several threads produced important designs but no final metrics in accessible artifacts:

- Deep KataGo variations: top-3 principal variations, 5-8 plies deep, with winrates and ownership summaries. This was framed as a way to make commentary richer and more AI4Math-relevant.
- Go RL / KataGo tool use: using KataGo as a verifier/reward source for RLVR/GRPO, especially after SFT reaches a ceiling.
- Adversarial code coupling: train on mismatched explanation-claim pairs and test whether claim pressure can override corrupted prose.
- 10k/50k/180k Go scaling: recommended because the 1k/200 GPT-OSS run showed class collapse and overfitting.
- LoGos / hybrid Go architecture: explored as a stronger board-understanding substrate, but not represented by a completed metrics table in the accessible artifacts.

## Paper-facing implications

The strongest current paper claim is not “consistency loss improves language quality everywhere.” The evidence supports a sharper and more defensible claim:

> Inline claims plus consistency loss can make natural-language rationale spans carry verifier-relevant information across controlled, formal, and domain-oracle settings. However, decodable coupling is not sufficient for human-usable explanations; diagnostic controls, counterfactuals, and causal interventions are necessary to distinguish coupling from shortcut learning.

Main-paper priority:

1. Synthetic: include scaled convergence, hidden-state patching, claim-only control, and overlapping-vocab hard run.
2. LeanCheck: include as the clean formal-verifier result, with proof-only/rationale-only separation.
3. KataGo: include as the dense domain-verifier result.
4. Code: include as the diagnostic negative showing coupling does not imply usable prose.
5. FEVER: compress or move to appendix because it is weaker and less venue-relevant.

Appendix priority:

- Full per-variant tables for synthetic hard, LeanCheck, FEVER, KataGo, code V1/V2/rich ontology, and Go GPT-OSS.
- Manual code review examples showing structured claims and prose contradictions.
- Go per-claim confusion summaries, especially the epoch-4 class-collapse patterns.
