# Perplexity Thread Extraction Report
Generated: 2026-05-17

---

## LINK 1: https://www.perplexity.ai/search/61c39267-1363-4a56-89a7-3009feaceb8b
**Title/Topic:** Q-Former Architecture for Go/KataGo — Cross-Attention & Token Compression  
**Accessible:** ✅ Yes (after closing login modal)

### Experiment Purpose
Conceptual/design discussion about using a Q-Former (query-based cross-attention module from BLIP-2) to compress and translate KataGo trunk features into a compact token set usable by an LLM. No standalone experiment is run here — this is architectural Q&A.

### Code/Data Described
- KataGo trunk: produces feature tensor of shape `[B, 361, 384]` — one 384-dim vector per board intersection on a 19×19 board
- Q-Former compresses 361 trunk positions into 32 fixed query tokens
- Flow: `KataGo trunk → Q-Former cross-attention → 32 tokens → projection → LLM`

### Architecture Details
- **Board Encoder**: 22 spatial + 19 global KataGo input channels
- **Q-Former role**: acts as learned adapter (bridge) between spatial board features and LLM token space
- **Mechanism**: Q-Former starts with N learnable query vectors; queries cross-attend to frozen KataGo trunk; produces fixed-size token set
- **Cross-attention**: Q (learnable queries) × K,V (KataGo trunk positions) → weighted combination → updated query representations
- Analogy: "32 analysts each reading 361 board cells, each asking a different question"

### Variants/Ablations
None run — this is theoretical discussion

### Numeric Results
- Trunk shape: `[B, 361, 384]` (batch × intersections × channels)
- Output tokens: 32 (configurable)
- Board size: 19×19 = 361 intersections

### Qualitative Findings
- Q-Former superior to linear projector because it actively selects/reorganizes information
- Each query token can specialize (local tactics, global territory, etc.)
- Too many tokens = longer prompt + harder job for LLM

### Limitations
- Conceptual only; no experiments run
- Q-Former doesn't produce clean symbolic concepts — produces grounded interpretations

### Artifacts/Files Mentioned
- None

---

## LINK 2: https://www.perplexity.ai/search/7c29e5d3-bda1-4894-8662-72a53f8cade9
**Title/Topic:** GPT-5.5 Pro + KataGo Tool Use — Verifiability Assessment & Full Architecture for KataGo-Grounded Transformer  
**Accessible:** ✅ Yes

### Experiment Purpose
1. Assessing whether GPT-5.5 Pro with KataGo tool use can provide accurate, verifiable explanations for Go positions
2. Full architecture/build guide for a KataGo-grounded transformer that generates NL Go analysis (Mastermind-Go-style)
3. Discussion of hidden-state intervention and causal patching experiments to validate claim-consistency coupling

### Code/Data Described

**Training data (4-task curriculum from Mastermind-Go):**
- Task 1 (State-Transition): 150,000+ examples from KataGo self-play — `(board_state, move) → next_board_state`
- Task 2 (KataGo Position Analysis): 138,693 samples from 36 KataGo trajectories — `(board_state, candidate_moves, komi) → {ownership_map, score_lead, win_probability}`
- Task 3 (NL Explanation): Only 1,503 book samples from Go textbooks
- Task 4 (Integrated): Chain-of-thought combining all above

**KataGo JSON output fields used:**
- `rootInfo.winrate`, `rootInfo.scoreLead`, `moveInfos[0].move`, `moveInfos[].order`, `moveInfos[].visits`, `moveInfos[].pv`, `moveInfos[].prior`, `ownership[]`, `ownershipStdev[]`

**LoRA hyperparameters (from Mastermind-Go):**
```python
# LoRA: r=32, alpha=64
# Learning rate: 5e-5 with cosine decay
# Gradient clipping: norm=1.0
# Dropout: 0.1
# Hardware: 8x A100 GPUs
# Optimizer: AdamW
```

**Training loss:** `L_θ = (1/N) Σ log θ(Yi|Xi)` (next-token cross-entropy over answer tokens only)

**Structured output schema:**
```json
{
  "win_probability": 0.73,   // KataGo API — not generated
  "score_lead": 5.2,         // KataGo API — not generated
  "top_move": ...,           // KataGo API — not generated
  "territories": {...},
  "explanation": ...         // LLM-generated, must reference above
}
```

**Board text encoding:** `#` = black, `o` = white, `•` = empty; recent moves tagged with index (e.g., `#(3)`, `o(1)`)

**Activation patching (hidden-state intervention) pseudocode:**
```python
def patch_hook(acts):
    acts = acts.clone()
    acts[:, rationale_pos_indices, :] = clean_rationale_acts
    return acts
```

### Variants/Ablations (Architecture table)
| Variant | Description |
|---|---|
| GPT-5.5 + KataGo MCP (prompt only) | Baseline |
| SFT on KataGo analysis (Tasks 1–2) | |
| Full Mastermind-Go 4-task SFT | |
| Full SFT + DPO with dan-level feedback | |
| Hybrid CNN encoder (frozen KataGo trunk) + LLM decoder | Strongest |

### Numeric Results
- Mastermind-Go Task 1: **99.44% accuracy** (single-task), **96.08%** (multi-task)
- Mastermind-Go Task 2: Score MAE **1.74 points**
- Task 3: Only 1,503 book samples (severely data-starved)
- GPT-5.5 Pro: **82.7%** on Terminal-Bench 2.0, **78.7%** on OSWorld-Verified (released April 23, 2026)
- General LLMs: **below 35% accuracy** on next-move prediction vs KataGo top-10

### Qualitative Findings
- GPT-5.5 Pro with KataGo can provide accurate numerical claims (win rates, score leads, top moves) that trace back to KataGo
- Natural language conceptual explanations are NOT verifiable by KataGo — they come from LLM training data
- DPO preferred over RLHF for alignment; preferred explanations must accurately reflect KataGo evaluation
- Catastrophic forgetting: mixing ~10% general instruction data prevents loss of date/spatial reasoning

### Failure Modes
- LLM hallucination structurally unavoidable
- Board representation bottleneck (19×19 spatial-temporal relationships hard to tokenize)
- KataGo can give overconfident/underconfident estimates, switches correct move at higher search time (use `rootNumSymmetriesToSample: 8`)
- Task 3 is severely data-starved (1,503 samples; need 50,000+)

### Artifacts/Files Mentioned
- Mastermind-Go: `https://huggingface.co/datasets/OpenDILabCommunity/MasterMind`
- KataGo Analysis Engine docs: `https://github.com/lightvector/KataGo/blob/master/docs/Analysis_Engine.md`

---

## LINK 3: https://www.perplexity.ai/spaces/go-7Bj2MY2ATv6htB0IVU2VSw/bf412963-1168-4efa-9aae-197ff63d6ea3
**Title/Topic:** Claim Consistency Coupling Experiment in PyTorch (Spaces thread, Apr 28)  
**Accessible:** ✅ Yes

### Experiment Purpose
Build and run a minimal PyTorch experiment to validate **claim-consistency coupling** in a small decoder-only transformer. Tests whether an auxiliary consistency loss head forces the model to encode latent state information in its rationale hidden states, which then predicts claims.

### Code/Data Described

**Synthetic dataset:**
- 8–16 configurable latent states
- Each latent state: 3–5 paraphrased rationale templates + 1 deterministic claim label
- Returns: token IDs, next-token targets, rationale mask, full-sequence mask, earlier-token mask, latent state label
- Includes shuffled-pairing control (mismatched rationale-claim pairs)
- Smoke run: 512 training, 128 eval, 64 counterfactual samples; 8 latent states; 5 epochs; CPU-only

**Model:** Small GPT-2-style decoder-only Transformer in plain PyTorch
- Next-token LM loss
- Auxiliary mean-pooled consistency classification head (configurable over rationale tokens, full sequence, or earlier tokens)

**FEVER GPU experiment:**
- Dataset: `copenlu/fever_gold_evidence`
- Base model: pretrained HuggingFace GPT-2 (`GPT2LMHeadModel`, `GPT2TokenizerFast`)
- Sequence format: `[BOS] <evidence_passage> [SEP] <claim> [LABELSEP] <label_token> [EOS]`
- Special tokens added: `[BOS]`, `[EOS]`, `[PAD]`, `[SEP]`, `[LABELSEP]`, `[SUPPORTS]`, `[REFUTES]`, `[NEI]`
- Defaults: 50,000 train / 5,000 eval samples, max_seq_len=256, batch_size=16, 5 epochs, lr=5e-5, consistency_loss_weight=0.5, freeze_lower_layers_epochs=1

**CLI command:**
```bash
python run_fever_pretrained_gpu.py --require-gpu \
  --model_name gpt2 \
  --train_samples 50000 \
  --eval_samples 5000 \
  --max_seq_len 256 \
  --epochs 5 \
  --batch_size 16 \
  --lr 5e-5 \
  --consistency_loss_weight 0.5 \
  --freeze_lower_layers_epochs 1 \
  --variants no_consistency_loss evidence_only_pooling full_sequence_pooling claim_only_pooling \
  --output_csv results_fever_pretrained_gpu.csv
```

### Variants/Ablations

**Synthetic experiment (4 variants):**
| Variant | Pooling Region |
|---|---|
| no_consistency_loss | No auxiliary head |
| rationale_only_pooling | Pool rationale tokens only |
| full_sequence_pooling | Pool full sequence |
| earlier_token_only | Pool only earlier tokens |

**FEVER GPT-2 (4 variants):**
| Variant | Description |
|---|---|
| no_consistency_loss | Baseline, no auxiliary loss |
| evidence_only_pooling | Pool evidence tokens (before [SEP]) |
| full_sequence_pooling | Pool full sequence |
| claim_only_pooling | Pool claim tokens (between [SEP] and [LABELSEP]) |

### Numeric Results (Smoke test — CPU, 8 latent states, 5 epochs)

**Key finding:** Tables had empty cells in rendered text, but the report stated:
- All consistency-trained variants: **perfect classifier accuracy (100%)** when decoding claims from rationale-pooled hidden states
- No_consistency_loss: **75% generation accuracy** despite near-zero classifier accuracy (**3.9%**)
- full_sequence pooling: **93.8% generation accuracy** (highest)
- rationale_only: **63% generation accuracy**
- earlier_token_only: **62% generation accuracy**
- Counterfactual swap: all consistency-trained variants — classifier follows swapped rationale
- Shuffled-pairing control: all models ~8–10% (near chance)

**Validation (no GPU workspace):**
| Check | Result |
|---|---|
| Python compile: fever_pretrained_gpt2_experiment.py | OK |
| Python compile: run_fever_pretrained_gpu.py | OK |
| torch.cuda.device_count() | 0 |
| --require-gpu --smoke-test | Exits with error (expected) |

**Workspace:** `torch.__version__ = 2.11.0+cu130` but `torch.cuda.is_available()` = False. No GPU training run.

**Hardware estimates:**
- A10/L4: enough for 4 variants × 5 epochs, GPT-2 base
- gpt2-medium: ~3× slower per step than gpt2

### Qualitative Findings
- Consistency loss successfully forces claim identity into pooled hidden states
- Counterfactual swap test: classifier follows swapped rationale rather than original
- Shuffled-pairing control drops to near chance — coupling is specific to rationale content
- `full_sequence` variant is strongest; `rationale_only` and `earlier_token_only` need more training

### Limitations
- No GPU — full FEVER run not executed
- Only CPU smoke test with tiny synthetic data

### Artifacts/Files Created
- `claim_consistency_coupling_experiment.ipynb` — runnable notebook
- `claim_consistency_coupling_experiment_executed.ipynb` — pre-executed with outputs
- `claim_consistency_experiment.py` — full implementation module
- `results_comparison.csv` — smoke-test comparison table
- `README.md`
- `fever_pretrained_gpt2_experiment.py`
- `run_fever_pretrained_gpu.py`
- `README_FEVER_GPU.md`

**Experiment progression noted:**
1. `claim_consistency_experiment.py` — scratch synthetic
2. `fever_claim_consistency_experiment.py` — FEVER scratch
3. `fever_pretrained_gpt2_experiment.py` — FEVER pretrained GPT-2

---

## LINK 4: https://www.perplexity.ai/search/7eae3017-9f02-4a34-bd50-742df95924e6
**Title/Topic:** Reading KataGo Experiment Results — Interleaved Claims + Consistency Loss; Code Generalization Design  
**Accessible:** ✅ Yes

### Experiment Purpose
Summarizing results from KataGo experiments with interleaved claims and consistency loss; designing a code generalization experiment.

### Key Results Summary (from memory context)

**FEVER Proxy Results:**
- **100% accuracy** where rationale representations fully determine scalar claim bins
- **100% faithfulness** to original rationales in counterfactual swaps (0% swap influence)
- Unchanged language model-level rationale quality
- `claim_only_pooling` negative control validated design: pooling wrong token span destroys coupling mechanism

**Generated Rationale + Scalar Claim Results:**
- Synthetic experiment (model generates both rationale text and structured scalar claims)
- **Consistency loss creates robust semantic coupling between generated prose and verifiable claims**
- Model learned to encode latent state information in generated rationales

**KataGo Win-Probability Results:**
- Clear positive results with oracle supervision and consistency loss
- (See Link 5 for detailed metrics)

**Core mechanism:** "LMs produce inline verifiable claims sharing hidden state representations with prose, graded by domain oracle"

### Code Generalization Experiment Design
To prove generalization to code:
- Replicate 3-component architecture with code domain
- Components: code generation component, claim component, verifier
- Domain oracle: execution-based verification of claims about code behavior

### Qualitative Findings
- The validated pattern is domain-agnostic
- Architecture: same 3-component (rationale + claim + verifier) applies across Go/FEVER/code

---

## LINK 5: https://www.perplexity.ai/search/e75fe5d3-5fa5-484f-8a3d-20ad8a9ca0af
**Title/Topic:** Analyzing Claim Consistency Coupling PyTorch Results; KataGo Win-Prob Experiment Metrics  
**Accessible:** ✅ Yes

### Experiment Purpose
Detailed analysis of the claim-consistency coupling experiment results (uploaded files), including strengths/weaknesses and key metric discussion. Also covers the KataGo win-probability experiment.

### Data Files Referenced (uploaded by user)
- `results_comparison.csv`
- `README.md`
- `claim_consistency_experiment.py`
- `claim_consistency_coupling_experiment_executed.ipynb`
- `katago_winprob_20260429.csv`
- `katago_winprob_20260429.md`

### Variants (Synthetic experiment)
- `no_consistency_loss`
- `rationale_only`
- `full_sequence`
- `earlier_token_only`

### Variants (KataGo win-prob experiment)
- `lm_only`
- `no_consistency_loss`
- `rationale_only`
- `full_consistency`
- `random_consistency` (random bucket labels — sanity check)

### Numeric Results — Synthetic Experiment
| Variant | Gen Acc | Cls Acc | Cfact Follows Swap | Shuffled Acc |
|---|---|---|---|---|
| no_consistency_loss | 75% | 3.9% | ? | 8–10% |
| rationale_only | 100% cls | ~63% gen | 100% | ~8–10% |
| full_sequence | 100% cls | **93.8%** gen | 100% | ~8–10% |
| earlier_token_only | 100% cls | ~62% gen | 100% | ~8–10% |

### Numeric Results — KataGo Win-Prob Experiment (claim-bin accuracy)
| Variant | Claim-Bin Accuracy |
|---|---|
| lm_only | 0.016 (1.6%) |
| no_consistency_loss | 0.251 (25.1%) |
| rationale_only | **0.787 (78.7%)** |
| full_consistency | **0.811 (81.1%)** |
| random_consistency | 0.160 (16.0%) |

**Key pattern:** Baselines (lm_only, no_consistency_loss, random_consistency) near chance 1–25%. Consistency-trained variants jump to ~79–81%.

**`full_consistency` model has two heads:**
1. **Win-rate number head** — reads hidden state at `[CLAIM]` position (after reading board + comment); predicts exact KataGo win rate number
2. **Bucket-from-comment head (consistency head)** — reads only comment span tokens (between `[RAT]` and `[CLAIM]`); predicts coarse bucket (clearly behind/close/ahead)

**`random_consistency`** control: same architecture but random bucket labels fed to consistency head → does NOT achieve high accuracy on real buckets → confirms real consistency variants work because of actual alignment, not trivial artifact

### Experimental Flaws Identified
1. **Very small sample size**: only 512 training, 128 eval, 64 counterfactual samples
2. **Limited epochs**: only 5 epochs — `rationale_only` and `earlier_token_only` may need more
3. **Synthetic task doesn't prove real-world coupling** — non-overlapping token ranges per latent state with deterministic claim tokens
4. **No gradient/hidden-state intervention test** — counterfactual shows prediction follows swapped rationales, but no causal patching on generative model
5. **Weak baseline**: `no_consistency_loss` achieves 75% generation accuracy despite near-zero cls accuracy — task may be too simple
6. **Missing ablation**: no test of consistency head trained on wrong pooling spans (e.g., pooling claim tokens instead of rationale)

### What to Improve
- 10x+ samples (5,000+ training), 20–50 epochs, add hidden-state intervention test, test wrong pooling controls

### Qualitative Findings
- The experiment de-risked the core architectural bet
- Consistency head with causal masking can couple rationale representations to oracle-verified scalar claims
- Validated in domain-agnostic way

---

## LINK 6: https://www.perplexity.ai/search/7a631524-50dc-4085-8ee7-4a683ccd1362
**Title/Topic:** Causal Masking for Board/Explanation/Claim Layout; Multi-Head Claims; Dataset Scaling  
**Accessible:** ✅ Yes

### Experiment Purpose
Design of attention masking for a Go LLM with board-prefix → explanation → claims layout. Includes real experiment results on multi-head claim training on Go data (9k/1k positions), with per-head metrics showing class collapse.

### Code/Data Described

**Causal mask design (B=board, E=explanation, C=claim positions):**
```
M[q,k]:
  q ∈ B: allow k ∈ B with k ≤ q
  q ∈ E: allow k ∈ B ∪ E with k ≤ q  
  q ∈ C: allow k ∈ E (only explanation tokens)
```
Loss: `L = L_explanation + λ_c * L_claims`

**Dataset:** 9,000 train / 1,000 eval Go positions with human commentary and KataGo-verified claims

**Training setup (for ~7–8B model):**
- LoRA/QLoRA (r=8–16)
- Effective batch size 64–128
- 3–5 epochs initially with early stopping on macro-F1
- LR: 1e-4 to 5e-5 for 7–8B LoRA
- Warmup: 3–5% steps, cosine/linear decay
- Dropout: ~0.1

### Multi-Head Claim Results (on 1k positions)

| Head | Accuracy | Macro-F1 | Notes |
|---|---|---|---|
| global_contestedness | 0.865 | 0.765 | Working well |
| win_prob_bin | 0.315 | 0.177 | Collapses to bin 8; several bins never predicted |
| score_lead_bin | 0.550 | 0.171 | Mostly LEAD_CLOSE; never big leads |
| phase_estimate | 0.755 | 0.326 | Almost everything PHASE_OPENING; mid/late get F1 0.0 |
| Region/best-move heads | various | near 0 for many classes | Strong bias to bottom/top and CTRL_BOTTOM_BIG |

### Diagnosis
- Class distribution skew → majority baseline does well on accuracy
- Model undertrained or weak signal → sits on majority default bins
- Macro-F1 punishes this → much lower than accuracy

### Recommendations
1. Scale to 10k+ positions with balanced class coverage
2. Class-weighted cross-entropy or focal loss (γ≈1–2) for skewed heads
3. Simplify some heads: phase → binary/ternary; win/score bins → coarser or regression head
4. Check region representations — model can't distinguish board regions from text descriptions

### Scaling Analysis
- More data (1k → 10k → 50k+): strongly expected to improve rare class prediction
- More epochs (same 1k): eval macro-F1 already wobbling; further epochs mostly overfit majority classes
- **Key lever: more and better-balanced data, not just more epochs**

### Qualitative Findings
- Inline/modular claims (attached to local explanation spans) preferred over monolithic claims section with ~20 claim types
- Structure: 3–6 explanation units per position, each gets 0–3 claims from 20-claim inventory
- Mask where claim can attend only to attached explanation unit forces local grounding
- `global_contestedness` working well (strong anchor); other heads need data-scale and rebalancing

### Limitations
- Actual experimental numbers shown are from 1k dataset with early epochs — acknowledged as preliminary
- Region heads severely limited by model's inability to distinguish board regions

---

## LINK 7: https://www.perplexity.ai/search/49934aa4-b969-4956-895f-78e502ee8e35
**Title/Topic:** RL with Reasoning LM to Improve LoGos Explanation Quality  
**Accessible:** ✅ Yes (short thread)

### Experiment Purpose
Addressing the limitation that in LoGos each explanation token receives identical gradient (reward is outcome-level, not per-token). Proposes RL with a reasoning LM where the model outputs top move based on its explanation of the position.

### Code/Data Described
No specific code. References KataGo evaluation as reward oracle.

### Core Mechanism
- RL creates direct selection pressure: explanations that lead to strong moves get rewarded
- Unlike LoGos: this method makes each word in the explanation instrumental to the outcome
- Fine-grained reward at sentence/sub-sentence level: specific reasoning steps associated with their contribution to move quality

### Qualitative Findings
- RL rewards model based on move quality derived from explanation → explanation tokens become causally important
- Model learns that vague or incorrect reasoning produces weak moves and lower rewards; precise tactical analysis yields better moves and higher rewards from KataGo evaluation
- Internal representations update through gradient optimization: individual neurons and directions organize around specific Go concepts (territory, atari, influence)
- KL divergence penalty preserves original model's concept coverage

### Limitations
- KL divergence penalty prevents model from completely abandoning LLM knowledge
- No specific numeric results cited in this thread (references to prior work)

### Artifacts/Files
- Follow-up artifact: "How to fix LoGos sparse rewards — top RLVR methods vs KataGo win rates compared" (Computer task)

---

## LINK 8: https://www.perplexity.ai/search/6db52913-3dcd-4dac-9929-4112a9c1e28a
**Title/Topic:** LoGos Reward Limitation, Prefix-Locked Explanation Bottleneck, and Second RL Run Design  
**Accessible:** ✅ Yes

### Experiment Purpose
Deep dive into: (1) precisely what LoGos reward limitation is, (2) RL-with-reasoning-LM design, (3) how it forces explanation quality, (4) internal representation mechanics. Then discusses whether a second RL run on public LoGos model (with modified attention mask) makes sense.

### Key Conceptual Content

**LoGos reward limitation:**
- Uses **segmented, outcome-level reward** tied to predicted move + win-rate
- Piecewise reward function scores whole response by whether final move is in KataGo's top-10 candidates
- Every token in reasoning chain — sentences about influence, shape, ko, ladders — receives **identical gradient signal** scaled only by group-relative advantage Â_i from GRPO
- Commentary words are NOT in reward's support → unfaithful reasoning pathology
- LoGos human evaluators: only **55.6% of explanations were correct** even when move predictions were right

**Prefix-locked explanation design:**
```
Full sequence: board description + explanation tokens + [MOVE]
```
- Model NOT allowed to write move until after explanation is finished
- Move logits are a function of the explanation, not produced in parallel
- RL reward flows back through move → through explanation tokens → into explanation content

**Bottleneck test:** "If you surgically delete explanation tokens and only keep board embedding + move head, does the system still play well? If yes → explanation was decorative."

**One decoder enforcement:**
```python
# WRONG (LoGos-style side channel):
move_head(board_embedding)

# CORRECT (prefix-locked):
move_logits = lm.forward(board_tokens + explanation_tokens)
```

**Second RL run on LoGos:**
- Proposed: same reward + attention mask forcing move to attend only to explanation tokens (not board)
- Analysis: this is a good step but insufficient alone because:
  1. Board still in context window; explanation tokens already attended to board → board info packed into hidden states
  2. Reward still outcome-only → doesn't penalize incorrect claims that don't hurt move
  3. Attention masking is brittle — model can compress board info into "summary" hidden state propagated through explanation tokens
- **Requires:** Strict architectural constraint at move time + no special board-only representation + optional light process reward signal

### Qualitative Findings
- LoGos: "board → internal board representation → move" and separately "board → explanation text" (move doesn't need explanation)
- Prefix-locked: "board text → explanation tokens → internal state after reading explanation → move"
- Both systems see board in prompt, but information flow at move time is different
- LoGos architecture does NOT have a "special encoder → move head" — board is also fed as text tokens, but the reward/gradient wiring allows shortcut

### Limitations
- Even prefix-locked design can still pack board info into hidden states that bypass human-readable explanation text
- Pure outcome-based RL hits faithfulness ceiling without explicit process reward

---

## LINK 9: https://www.perplexity.ai/search/eb31d5a7-705b-49b6-b096-392ed76720d7
**Title/Topic:** FEVER Tightened Run — Evidence-Strict Masking vs. Evidence-Only Pooling; Matched-Claim Counterfactuals  
**Accessible:** ✅ Yes

### Experiment Purpose
Running a diagnostic FEVER experiment with tightened evidence masking (`evidence_only_strict`) to test whether strict masking improves evidence sensitivity over previous weak evidence-only pooling.

### Code/Data Described

**Dataset:** `copenlu/fever_gold_evidence`  
**Model:** Pretrained GPT-2  
**CLI command:**
```bash
python modal_fever_run.py \
  --model_name gpt2 \
  --train_samples 50000 \
  --eval_samples 5000 \
  --max_seq_len 256 \
  --epochs 5 \
  --batch_size 16 \
  --lr 5e-5 \
  --consistency_loss_weight 0.5 \
  --freeze_lower_layers_epochs 1 \
  --seed 42 \
  --output_stem fever50k_tightened_diag \
  --variants no_consistency_loss,evidence_only_pooling,evidence_only_strict,claim_only_pooling,evidence_only_random_labels
```

### Variants
- `no_consistency_loss` (baseline)
- `evidence_only_pooling` (old)
- `evidence_only_strict` (new, stricter masking)
- `claim_only_pooling` (shortcut/negative control)
- `evidence_only_random_labels` (sanity check)

### Numeric Results (from `fever50k_rerun_20260428.csv` / `.md`)

| Variant | cls_acc | gen_acc | cfact_cls_swap | cfact_cls_orig | matched_swap | matched_orig |
|---|---|---|---|---|---|---|
| no_consistency_loss | ~0.53 | ~0.80 | — | — | — | — |
| evidence_only_pooling | ~0.44 | ~0.81 | ~0.46 | ~0.26 | ~0.46 | ~0.23 |
| evidence_only_strict | ~0.44 | ~0.81 | ~0.46 | ~0.26 | ~0.46 | ~0.23 |
| full/claim pooling | **mid-0.83s** | slightly improved | — | — | ~0.34 | ~0.44–0.46 |
| evidence_only_random_labels | **~0.23** (near random) | ~0.81 | ~0.22 | ~0.42 | — | — |

### Key Pattern
- `no_consistency_loss` already decent (~0.53 cls, ~0.80 gen)
- Full/claim pooling + consistency loss pushes cls_acc to mid-0.83s → real label-conditioning gain
- **Strict evidence-only does NOT significantly improve over old evidence-only pooling** — strict masking fails to help
- Evidence-only variants clearly more evidence-sensitive on **matched-claim counterfactuals** (matched_swap ~0.46) vs full/claim pooling (matched_swap ~0.34)
- Random-label control: cls_acc ~0.23 ≈ random for 3-way task → confirms consistency head not trivially overfitting
- `gen_acc` stays ~0.81 for random labels → LM path largely independent of mis-trained evidence-only head

### Qualitative Findings / Conclusion
- **Negative result for pure evidence-local coupling on real language**: strict masking doesn't beat full/claim pooling
- However: evidence-only is genuinely evidence-sensitive (shown by matched-claim counterfactuals)
- FEVER result is real but not clean "evidence-only hidden states carry all label signal" story from synthetic

**Paper-ready paragraph conclusion:**
"On FEVER, consistency loss reliably strengthens coupling between evidence and label. However, even with strict masking, evidence-only pooling never reaches the accuracy of full/claim pooling; the model still performs better with access to the full sequence or claim-position context. The random-label control confirms the evidence-only metrics are not trivial artifacts."

**FEVER vs. Synthetic comparison sketch:**
| Setting | Synthetic | FEVER |
|---|---|---|
| (Sketch table — cell values not fully rendered) | rationale-pooled consistency → near-perfect | evidence-only → partial |

### Limitations
- Strict masking not sufficient: board/evidence info propagates through earlier tokens' hidden states
- Real text coupling (FEVER) is harder than synthetic coupling

### Artifacts/Files
- `fever50k_rerun_20260428.csv`
- `fever50k_rerun_20260428.md`
- `fever_pretrained_gpt2_experiment.py` (uploaded by user)
- `modal_fever_run.py`

---

## LINK 10: https://www.perplexity.ai/search/a1b9115c-06de-4fb1-8eb5-2bc9468ec18b
**Title/Topic:** LoGos Reward Deep Dive + Codex Pilot Spec: Tool-Augmented Two-Pass Go Reasoning with Counterfactual GRPO  
**Accessible:** ✅ Yes

### Experiment Purpose
Two-part: (1) Deep research into LoGos reward limitations and why RL+reasoning LM with top-5 PV rewards would improve explanation quality; (2) Full specification (v0.1) for a Codex pilot implementing tool-augmented two-pass Go reasoning with counterfactual GRPO.

### Codex Pilot Spec — Overview

**Scope:** 10k–50k position RL training, single node (2× A100 80GB)  
**Goal:** Validate two-pass structured-concept bottleneck with tool-augmented self-verification and counterfactual GRPO  
**Base model:** Qwen3-8B (non-thinking mode); fallback: Qwen2.5-7B-Instruct

**Three core ideas:**
1. **Two-pass bottleneck:** Pass 1 → structured concept schema (JSON); Pass 2 → PV predictions + evaluation (conditioned on Pass 1 schema)
2. **Tool-augmented self-verification:** Deterministic Go tools called during Pass 1 for board-engine verifiable facts
3. **Counterfactual GRPO:** After each rollout, perturb concept fields minimally, re-run Pass 2, reward concepts that causally affect Pass 2 output

### Concept Schema (26 fields, 3 tiers)

**Tier 1 — Board-engine verifiable (deterministic):**
| Field | Tool |
|---|---|
| Gk.liberties | count_liberties() |
| Gk.in_atari | is_in_atari() |
| Gk.size | get_group_size() |
| Gk.is_connected_to_edge | — |
| ladder_exists | check_ladder() |
| ladder_favor | check_ladder() |
| ko_active | is_ko_active() |
| capture_race_exists | check_capture_race() |
| capture_race_favor | check_capture_race() |
| net_exists | — |

**Tier 2 — boardmatcher-grounded (named human concepts):**
| Field | Tool |
|---|---|
| focus_move_pattern_name | nameMove() on candidate move |
| focus_shape_type | findPatternInMove() shape classification |
| joseki_deviation | — |

**Tier 3 — KataGo-grounded (oracle signal):**
| Field | Source |
|---|---|
| position_sharpness | rootInfo.rawStWrError |
| score_uncertainty | rootInfo.scoreStdev |
| game_phase | rootInfo.rawVarTimeLeft (bucketed) |
| best_move_utility | moveInfos[0].utility |
| utility_gap_1_2 | moveInfos[0].utility − moveInfos[1].utility |
| best_move_score_delta | moveInfos[0].scoreLead − rootInfo.scoreLead |
| focus_region_ownership_mean | ownership array summed over region |
| focus_region_ownership_uncertainty | ownershipStdev array summed over region |
| Gk.ownership_confidence | ownershipStdev for group stones |
| best_move_ownership_delta | moveInfos[0].ownership − rootInfo.ownership |
| global_black_winrate | rootInfo.winrate |
| global_black_score_lead | rootInfo.scoreLead |
| Gk.status | ownershipStdev threshold + ownership sign |

### Tool Specifications
```python
def count_liberties(anchor: str) -> int: ...
def is_in_atari(anchor: str) -> bool: ...
def get_group_status(anchor: str) -> str: ...
def check_ladder(prey_anchor: str, chaser_anchor: str) -> dict: ...
def is_ko_active() -> bool: ...
def count_territory(region: str, color: str) -> int: ...
def check_capture_race(group1_anchor: str, group2_anchor: str) -> dict: ...
```

### Reward Function
```
R_pv = sum_k [utility(PV_k) × match_score(PV_k, katago_pvs)]  # PV rank-sensitive
R_own = 1 - clip(|model_territory - katago_territory| / max_territory, 0, 1)
R_score = exp(-|model_score_lead - katago_score_lead| / 5.0)
R_wr = 1 - |model_winrate - katago_winrate| / 100

R_out = 0.40 × R_pv + 0.25 × R_own + 0.20 × R_score + 0.15 × R_wr

R_cf = sum_j clip(Δ_j, 0, 0.5)  # Counterfactual delta

R_total = R_fmt × (0.4 × R_out + 0.35 × R_proc + 0.25 × R_cf)
```

**Process reward:** Tool-call contradictions get `-1.5 × field_weight` (higher than incorrect-without-tool because model had access to oracle)

### Training Setup
- **LoRA:** `target_modules=[...]`, bias=..., task_type=...
- **Hardware:** 2× A100 80GB (one for vLLM generation, one for TRL GRPOTrainer)
- **Single A100 fallback:** QLoRA colocate mode
- **Cold-start SFT:** 1–2 epochs on 500–1000 demonstration traces before RL
- **GRPO:** G=6 samples per position, 8 positions per batch

### Data Mix
- Professional game corpora (Kogo's Joseki Dictionary, GoGoD): ~40%
- Synthetic critical positions (high KataGo visit-count variance): ~40%
- LoGos training positions (if available): ~20%

### Position Filtering
- Exclude: KataGo top move utility > 95% AND second move < 50% (forced)
- Exclude: > 90% territory certainty across whole board (nearly decided)
- Require: ≥ 30% of positions have verifiable tactical claim (atari, ladder, ko)

### Baselines
1. LoGos-style outcome-only (no two-pass, no tool use)
2. Two-pass no-tool (hardcoded process rewards, no tool calls)

### Success Criteria
1–3: Move quality metrics ≥ LoGos baseline
4: Concept accuracy > no-tool baseline by ≥5 pp
5–6: Counterfactual delta > 0 (concepts causally affect Pass 2 outputs)

### Top-5 PV Reward Analysis
- Single-move reward is binary/sparse; top-5 PV reward is softer and less sparse
- Enables multi-step lookahead reasoning (predicting consequences, not just root move)
- Trajectory diversity: different PVs require different justifications → reduces GRPO rollout collapse
- Risk: PV divergence at high move count → contradictory explanations

### Self-Verification vs Go-Specific PRM
| Claim type | Self-verifiable? | Tool |
|---|---|---|
| Liberty count, atari | Yes | count_liberties, is_in_atari |
| Ladder | Yes | check_ladder |
| Ko, capture race | Yes | is_ko_active, check_capture_race |
| Comparative move quality | No | KataGo outcome reward |
| Win-rate influence assessment | No | KataGo outcome reward |
- Using OSS model with tool use is recommended — tool calls MUST be executed by external process (not model)
- Main risk: tool-call reward hacking (model generates call response consistent with its claim)
- Fix: runtime executes against actual board state, injects response

### Qualitative Findings
- LoGos unfaithful reasoning pathology: only 55.6% explanations correct even when move predictions were right
- Concept bottleneck forces explanation tokens to lie on unique causal path from input to reward
- GRPO already implicitly defines a process reward model over token prefixes (shared-prefix theory)
- RL produces qualitatively different internal representations vs SFT
- Concept directions become linearly separable in activation space under RL

### Limitations
- Faithfulness ceiling: even with richer rewards, outcome-based RL hits ceiling (~25–39% faithfulness) without dense intermediate supervision
- Reward hacking through explanation style — model learns phrases that statistically co-occur with high-KataGo-rank positions
- PV divergence at high move count
- KL regularization needed to prevent concept coverage collapse

### Artifacts/Files
- `codex_pilot_spec` (artifact created in Spaces)
- Concept schema: 26 fields, 3 tiers
- Full directory structure:
```
codex/
├── README.md
├── data/positions/, katago_cache.jsonl, splits/
├── go_engine/board.py, tools.py, ladder.py, territory.py
├── training/cold_start_sft.py, grpo_train.py, reward.py, counterfactual.py, tool_env.py, config.py
├── evaluation/eval.py, concept_accuracy.py, causal_eval.py
├── prompts/pass1_system.txt, pass2_system.txt, tool_definitions.json
├── preprocess/katago_batch.py, filter_positions.py
└── requirements.txt
```
- KataGo model: `kata1-b40c256-s11840935168-d2898845681`
- boardmatcher: `@sabaki/boardmatcher` (Node.js, wrapped as Python subprocess)

---

## SUMMARY TABLE

| Link | Topic | Accessible | Key Numeric Results |
|---|---|---|---|
| 1 | Q-Former architecture for KataGo-LLM bridge | ✅ | 361 trunk positions → 32 query tokens |
| 2 | GPT-5.5 + KataGo; full build guide | ✅ | Mastermind-Go Task1: 99.44%; Task2 Score MAE: 1.74pts; LLMs <35% move accuracy |
| 3 | Claim-consistency coupling PyTorch experiment | ✅ | full_sequence: 93.8% gen acc, 100% cls; no_consistency: 75% gen, 3.9% cls; FEVER GPU not run |
| 4 | KataGo experiment results summary; code generalization | ✅ | FEVER proxy: 100% accuracy, 100% faithfulness; KataGo positive results |
| 5 | Analysis of experiment results; KataGo win-prob metrics | ✅ | lm_only: 1.6%; no_consistency: 25.1%; rationale_only: 78.7%; full_consistency: 81.1%; random: 16.0% |
| 6 | Causal masking design; multi-head claim results | ✅ | global_contestedness: Acc 0.865, F1 0.765; win_prob_bin: Acc 0.315, F1 0.177; scale data 1k→10k |
| 7 | RL with reasoning LM to fix LoGos reward sparsity | ✅ | Conceptual; no new numerics |
| 8 | LoGos limitations; prefix-locked explanation bottleneck | ✅ | LoGos: 55.6% explanation accuracy even when move correct |
| 9 | FEVER tightened run (evidence_only_strict) | ✅ | evidence_only: cls_acc~0.44; full/claim pooling: cls_acc mid-0.83s; random_labels: cls_acc~0.23 |
| 10 | Codex pilot spec: tool-augmented two-pass Go RL | ✅ | Reward formulas; 26-field schema; R_total = R_fmt × (0.4×R_out + 0.35×R_proc + 0.25×R_cf) |
