# Code Coupling Experiment — Structured Report
**Source:** [Perplexity Thread 7eae3017-9f02-4a34-bd50-742df95924e6](https://www.perplexity.ai/search/7eae3017-9f02-4a34-bd50-742df95924e6)

---

## 1. CONTEXT & BACKGROUND

The thread discusses a series of experiments on **text-mediated coupling** (also called "consistency loss coupling"), validated first on FEVER (fact verification), then on KataGo (Go game commentary), and now being extended to **code**. The user's prior experiments produced strong positive results and the thread explores (a) how to design a code generalization experiment, (b) the actual results of a rich-ontology code experiment, and (c) follow-up analysis of failure modes.

---

## 2. EXPERIMENT GOAL

**Primary goal:** Prove that the consistency-loss coupling mechanism generalizes from natural language (Go/FEVER) to the **code domain**.

**Core claim to test:** "Consistency loss is sufficient to pull random explanations toward correctness (Phase 1) and maintains coupling when starting from correct explanations (Phase 2). This demonstrates both the mechanism's causal power and its practical applicability."

**Key hypothesis:** If the `claim_pressure_high` variant beats `frozen_pretrained` on factual accuracy, the coupling mechanism (prose ↔ verifiable claims via shared hidden states + consistency loss) is domain-agnostic.

---

## 3. VALIDATED PRIOR RESULTS (FEVER & KataGo)

### FEVER Proxy Results
- **100% accuracy**: Rationale representations fully determine scalar claim bins
- **100% faithfulness**: To original rationales in counterfactual swaps (0% swap influence)
- Unchanged language model-level rationale quality
- `claim_only_pooling` negative control: Pooling the wrong token span destroys the coupling mechanism → **proves claims are in the text, not hidden states**

### Generated Rationale + Scalar Claim Results
- Consistency loss creates **robust semantic coupling** between generated prose and verifiable claims
- Model learned to encode latent state information in generated rationales, which consistency loss then couples to oracle-verified scalar claims

### KataGo Win-Probability Results
- With oracle supervision and consistency loss: **clear positive results**
- Core mechanism validated: LMs produce inline verifiable claims sharing hidden state representations with prose, graded by domain oracle

---

## 4. CODE EXPERIMENT DESIGN

### Why Code Is a Good Domain
| Aspect | Go Domain | Code Domain |
|--------|-----------|-------------|
| Generated output | Natural language commentary | Natural language explanation |
| Interleaved claims | Win-prob, score-lead, urgency | Complexity, correctness, behavior |
| Oracle | KataGo analysis | Static analyzer / test suite |

### Why Code Is Stronger (Theoretical Advantages)
1. **Claim scalar oracle**: Static analysis tools (e.g., Big-O complexity analyzers) provide deterministic scalar/categorical outputs, just like KataGo's win-probability
2. **Generated prose + verifiable claim**: Model generates algorithm explanation + complexity claim → analyzer verifies
3. **No execution noise**: Unlike test-based verification, complexity analysis is deterministic and doesn't require running code

### Concrete Experiment Design — Vanilla Version

**Dataset:** 2,000–5,000 algorithm implementations with known complexity classes
- Use problems from LeetCode/Codeforces with annotated solutions
- Extract: problem description, solution code, ground-truth complexity (time/space)

**Model Architecture** (matching KataGo setup):
- Input: problem specification + solution code
- Output: step-by-step explanation interleaved with structured claims
- Claims: `<claim>time_complexity=O(n_log_n)</claim>`, `<claim>space_complexity=O(n)</claim>`
- Oracle: Static analyzer (or rule-based extractor for common patterns) verifies complexity

**Training:**
1. Fine-tune a small code LM (e.g., CodeLlama 7B or StarCoder 1B)
2. Apply consistency loss pooling explanation hidden states
3. Consistency head predicts complexity bin from explanation representations

**Consistency Loss Configuration:**
- Pool explanation token hidden states (from last layer)
- Train linear classifier to predict ground-truth claims from pooled states
- Add consistency loss to total loss with weight **λ=1.0**
- Train for 20 epochs, batch size 32, learning rate 5e-5

### Variants (run all)
| Variant | Description |
|---------|-------------|
| `consistency_loss` | Full mechanism (explanation pooling + consistency loss) |
| `no_consistency_loss` | Baseline (LM loss only, no consistency head) |
| `claim_only_pooling` | Negative control (pool claim tokens instead of explanation) |
| `random_label_consistency` | Negative control (consistency loss with shuffled labels) |

### Validation Metrics (measured every epoch on held-out 500 examples)
1. **Coupling strength**: Classifier accuracy predicting claims from explanation hidden states (target: random→90%)
2. **Explanation correctness**: BLEU/ROUGE between generated explanations and ground-truth explanations (target: 0%→70%+)
3. **Counterfactual swap influence**: Swap claims between two examples, measure if explanation hidden states follow swapped claims (target: 50%→95%)
4. **Claim accuracy**: Does model emit correct claim tokens? (should be high even at epoch 0 via LM loss)

**Critical prediction:**
- `consistency_loss` variant: Explanations should shift from random (epoch 0) to correct
- `no_consistency_loss`: Explanations stay random/only slightly improve

---

## 5. RICH-CLAIM ONTOLOGY EXTENSION

### Why the Initial 3 Claim Types Are Insufficient
The vanilla experiment uses only **3 claim types** (from `dataset.py`):
1. **Time complexity**: O(1), O(n), O(n²) — 3 bins
2. **Space complexity**: O(1), O(n), O(n²) — 3 bins
3. **Correctness**: binary (0 or 1)

**Problem 1: Extremely Sparse Signal** — With only 3 scalar claims, the model has minimal verifiable structure to couple to explanations. KataGo ontology had 30+ claim types.

**Problem 2: Pattern-matching shortcut** — The model may recognize code structure to predict complexity, bypassing explanation coupling.

### Rich Claim Ontology Design (12 verifiable types)

**Complexity Claims (3 types):**
- `time_complexity`: {O_1, O_log_n, O_n, O_n_log_n, O_n2, O_2n} – worst case
- `space_complexity`: {O_1, O_log_n, O_n, O_n2} – auxiliary space only
- `best_case_time`: {O_1, O_log_n, O_n, O_n_log_n, O_n2, same_as_worst}

**Algorithmic Structure Claims (4 types):**
- `algorithm_class`: {sorting, searching, graph_traversal, dynamic_programming, divide_and_conquer, greedy, ...}
- `loop_structure`: {single_pass, nested_2_level, nested_3plus, recursive, iterative_recursive}
- `key_operation`: {comparison, hash_lookup, tree_traversal, matrix_multiply, ...}
- `access_pattern`: {sequential, random, strided, tree, graph}

**Code Properties Claims (5 types):**
- `auxiliary_structures`: {none, array, hashmap, stack, queue, tree, ...}
- `mutates_input`: bool
- `correctness_status`: {fully_correct, off_by_one, wrong_base_case, wrong_loop_bound, ...}
- `handles_empty_input`: bool
- `handles_duplicates`: {correctly, incorrectly, not_applicable}

**Data schema (Python):**
```python
@dataclass
class RichExample:
    code: str
    correct_explanation: str
    mismatched_explanation: str
    # 12 claim types as separate fields
    time_complexity: str
    space_complexity: str
    best_case_time: str
    algorithm_class: str
    loop_structure: str
    key_operation: str
    access_pattern: str
    auxiliary_structures: str
    mutates_input: bool
    correctness_status: str
    handles_empty_input: bool
    handles_duplicates: str
    template_name: str

def build_rich_dataset(n=2500, seed=42) -> List[RichExample]:
    """Generate n examples with programmatic oracle."""
    ...
```

### Tokenizer Extensions
```python
CLAIM_TOKENS = [
    'time_complexity=O_1', 'time_complexity=O_n', 'time_complexity=O_n2', ...,
    'algorithm_class=sorting', 'algorithm_class=searching', ...,
    'loop_structure=single_pass', 'loop_structure=nested_2_level', ...,
    # ... all claim value combinations
]
```

### Dataset Requirements (Rich Explanation Experiment)
- Generate **2,500 training + 500 validation** examples of simple Python functions (sorting, searching, string manipulation, math, graph algorithms, dynamic programming basics)
- Functions: 5–20 lines of clean Python code
- Mix of correct implementations and buggy versions (~20% buggy)
- Complexity range: O(1), O(log n), O(n), O(n log n), O(n²), O(2^n)
- Include edge cases: empty input, single element, duplicates, negative numbers, None values

---

## 6. EXPERIMENTAL RESULTS

### V1: Standard Ablation Ladder Results (from `report.pplx-1.md`)

The first run tested the core mechanism against standard baselines:

**Consistency Loss (Full Mechanism) ✅**
- **100% coupling strength**: Explanation hidden states perfectly predict oracle-verified claims
- **100% counterfactual swap influence**: When claims are swapped between examples, explanation representations follow the swapped claims perfectly
- **BLEU-1: 0.0673**: Explanations show modest improvement from random initialization
- **100% claim accuracy**: Model emits correct claim tokens

**No Consistency Loss (Critical Baseline) ✅**
- **17.5% coupling**: Near-random coupling — proves consistency loss is necessary
- **5% swap influence**: Explanations ignore counterfactual swaps, as expected
- **BLEU-1: 0.0600**: Explanations stay nearly random (only slight improvement from LM loss alone)
- **Result**: This baseline proves causality — without consistency loss, explanations don't encode claims

### Rich Results Summary Table (from `rich_results_summary` CSV)

| Variant | Mean_Coupling | Swap_Influence | BLEU_1 | ROUGE_1 |
|---------|--------------|----------------|--------|---------|
| Consistency Loss (Main) | 0.986 | 0.977… | 0.0610720361509835 | 0.061178… |
| No Consistency Loss | 0.2863333… | -0.0333333… | 0.0686355661881977 | 0.061742… |
| Claim-Only Pooling | 0.5946666… | 0.1222222… | 0.0554189261031366 | 0.04… |
| Random Label Consistency | 0.562 | -0.0222222… | 0.0593285486443381 | 0.05… |
| No Claim↔Claim Attn (V2) | 0.9858333… | 0.977… | 0.0614013822434875 | 0.05… |
| Claims from Expl Only (V2) | 0.986 | 0.977… | 0.0557886762360446 | 0.05… |

### Interpretation Table (from `report.pplx.md`)

| Metric | Main Variant | Interpretation |
|--------|-------------|----------------|
| Mean Coupling | 98.6% | ✓ Claims encoded in representations |
| Swap Influence | 97.8% | ✓ Model follows counterfactuals |
| BLEU-1 | 0.061 | ✗ Generated text quality unchanged from 3-claim baseline |
| Manual Usable | 0/20 | ✗ No human-readable explanations |
| Claim Emission | 2.2% | ✗ Model doesn't verbalize claims |

---

## 7. FAILURE MODES

### Critical Failure: Model Never Learned Language

**Root Cause:** The model is trained from scratch (not pretrained), and it couples through `<sep>` tokens — for all 20 reviewed examples, when you force the model to couple through surface form without strong language modeling, it learns **degenerate shortcuts** rather than coherent language.

**With a scratch-trained model, the failure mode:**
- 98.6% coupling when claims attend to explanation hidden states
- BUT 0/20 usable explanations because the model never learned language

**With a pretrained model, the failure mode reverses:**
- Explanations would be fluent (pretrained English knowledge)
- Claims would be accurate (pretrained code knowledge)
- But you can't tell if consistency loss **caused** improvement or just **measured** pretraining quality

### Additional Identified Failure Modes

1. **Repeating `<sep>` tokens**: For all 20 reviewed examples — model learns degenerate shortcuts rather than coherent language

2. **`<sep>` token coupling shortcut**: Model couples through surface form, not through claim-mediated hidden states

3. **Pattern-matching bypass**: Without adversarial dataset design, pretrained model might just memorize code-structure → complexity pairings without coupling through explanations

4. **Representation coupling ≠ Language quality**: 12 richer claims couple perfectly to hidden states, but natural language quality is zero (usable explanations: 0/20)

5. **Claim richness hypothesis failed**: More claims don't improve prose. Rich-claim ontology proves the **mechanism works technically but fails the explanatory goal**

---

## 8. THE ADVERSARIAL DESIGN SOLUTION

### Key Insight: Adversarial Supervision Creates Causal Identification

By training on **mismatched explanation-claim pairs**, you create a causal intervention that isolates the consistency mechanism.

**Training data structure:**
```python
# Training data structure
{
    'code': bubble_sort_code,
    'explanation_target': "Uses divide-and-conquer recursion...",  # WRONG
    'claim_targets': {
        'time_complexity': 'O_n2',      # CORRECT
        'algorithm_class': 'sorting',   # CORRECT
        'loop_structure': 'nested_2_level',  # CORRECT
        ...
    }
}
```

**Without consistency loss:** Model writes fluent explanations but may:
- Copy the mismatched training explanation ("uses divide-and-conquer")
- Ignore claims entirely and just describe code structure
- Produce plausible-sounding but factually wrong prose

**With consistency loss:** Model should:
- Generate text that verbalizes the correct claims
- Resist corruption from mismatched supervision
- Produce explanations where claims are evidenced by the prose

### Two-Phase Experimental Protocol

**Phase 1: Random Initialization (Core Mechanism)**
- Initialize with random/permuted explanations
- Train with consistency loss
- Show: Explanations converge toward correctness **only when consistency loss is applied**

**Phase 2: Correct Initialization (Practical Validation)**
- Initialize with correct code-explanation pairs
- Train with consistency loss
- Show: Explanations stay correct and coupling is maintained

**Critical Baseline:**
- `no_consistency_loss`: Train only with LM loss on (code + random explanation + claims)
- Expected result: Explanations should stay random or only slightly improve
- This proves the consistency loss, not general language modeling, is what fixes the explanations

### Variant Table for Adversarial Experiment

| Variant | Expl Target | Claim Target | Claim Attention | Alpha | Beta | Purpose |
|---------|-------------|--------------|-----------------|-------|------|---------|
| mismatched_lm_only | mismatched | none | n/a | 1.0 | 0.0 | Corruption test: does LM-only learn wrong facts? |
| mismatched_claims_shortcut | mismatched | correct | code + expl | 1.0 | 1.0 | Shortcut test: can claims access code directly? |
| mismatched_claims_strict | mismatched | correct | expl only | 1.0 | 1.0 | Main hypothesis: equal pressure, strict bottleneck |
| claim_pressure_high | mismatched | correct | expl only | 0.3 | 1.0 | Strong test: can claims overpower bad prose? |
| claim_pressure_extreme | mismatched | correct | expl only | 0.0 | 1.0 | Degeneracy test: does it collapse to claim dump? |

### Verify Claim Extraction from Text Only

```python
def text_to_claims(generated_explanation: str) -> dict:
    """Extract claims from text using pattern matching or separate model."""
    claims = {}
    # Time complexity patterns
    if re.search(r'O\(n\^?2\)|quadratic|nested.*loop', generated_explanation):
        claims['time_complexity'] = 'O_n2'
    elif re.search(r'O\(n\s*log\s*n\)|divide.*conquer', generated_explanation):
        claims['time_complexity'] = 'O_n_log_n'
    # ... 12 claim types
    return claims

# Metric: Do extracted claims match oracle?
claim_accuracy_from_text = accuracy(text_to_claims(generated), oracle_claims)
```

**Claim verifier options:**
- Rule-based: regex/parser for "Time: O(n²)", "mutates input: false"
- Learned: Separate RoBERTa-style encoder trained on explanation→claim mapping
- LLM judge: GPT-4 scores whether explanation mentions the claim

### Adversarial Corruption Resistance Code
```python
# Train on: bubble_sort_code + "uses divide-and-conquer" (wrong)
# Test: Does generated explanation say "nested loops" (right) or "divide-and-conquer" (wrong)?
corruption_resistance = accuracy_on_adversarial_pairs
```

### Adversarial Minimal Pairs (One-line code diff)
```python
# Pair A: Correct
def search(lst, x):
    for i, val in enumerate(lst):
        if val == x:
            return i
    return -1

# Pair B: Off-by-one bug
def search(lst, x):
    for i, val in enumerate(lst):
        if val == x:
            return i + 1  # BUG
    return -1
```
- Test: Do explanations for A and B correctly distinguish?
- A: `correctness_status=fully_correct`; B: `correctness_status=off_by_one`

---

## 9. BOTTOM LINE / CONCLUSIONS

### From V1 Experiment
**Your proposed design is exactly right.** The key insights proven:
1. ✅ **Pretrained models are necessary** to remove linguistic fluency confound
2. ✅ **Adversarial mismatched supervision creates causal test** of whether claims improve text
3. ✅ **Strict attention masking blocks shortcuts** and forces text-mediated coupling
4. ✅ **Alpha/beta sweep measures dose-response** relationship
5. ✅ **Corruption resistance metric isolates causal effect** beyond pretraining

### What the Experiment Still Tests
- ❌ Not: "Can models learn code explanations?" (pretraining already does this)
- ✅ But: "Does claim supervision make explanations **resist factual corruption** when trained on adversarial prose?"

**If `claim_pressure_high` beats `frozen_pretrained` on factual accuracy, you've proven the mechanism works.**

### Conclusion from Manual Review (`manual_qualitative_review`)
> "The experiment supports a narrower claim: richer claims make it easier for hidden-state coupling, but the model still can't verbalize them. The core coupling mechanism works — 98.6% of claim information is encoded in explanation representations. But the explanatory goal fails — 0/20 human-readable, factually grounded explanations."

### Final Summary Statement
This is a **publication-ready experimental design**. The current 0/20 usable explanations was a training setup issue, not a hypothesis refutation. This new design finally tests the hypothesis properly.

---

## 10. FILENAMES / OUTPUT ARTIFACTS

| Filename | Description |
|----------|-------------|
| `report.pplx-1.md` | Standard ablation ladder results (V1 experiment) |
| `rich_results_summary` (CSV) | Full results table across all variants with coupling, swap influence, BLEU/ROUGE |
| `report.pplx.md` | Interpretation table + manual review conclusions |
| `manual_qualitative_review` | Manual review of 20 generated examples (0/20 usable) |
| `manual_review_scores_rich` | Manual review scores for rich-ontology experiment |
| `dataset.py` | Dataset generation script (3-claim and rich ontology versions) |

**Attachment reference:** User uploaded "7 attachments" and "read all the files and reports"

---

## 11. KEY QUESTIONS FROM THE USER (Follow-up Queries in Thread)

1. "Is there an issue with the data we are training on? Could it not be enough? Should we use a pretrained transformer model instead?" → Answered: Yes, pretrained necessary
2. "Would this be compelling if the model is trained on random explanations and we show the NL gets dragged along to explanations that match the code?" → Answered: Yes, extremely compelling — this directly demonstrates the core mechanism
3. "I think claim ontologies do not provide enough context. Evaluate the current claim types first to consider whether it is enough to generate a good explanation. Consider a further refinement of the experiment with more richer claim types for code. Would we need a third-party dataset?" → Answered: current 3 types are insufficient (sparse signal); richer 12-type ontology designed; synthetic data sufficient (or augment CodeNet/APPS/MBPP)
