# Consistency Loss Experiment — Results Report

**Generated:** 2026-04-30 21:19:19

## 1. Experiment Configuration

### Full Configuration (target)
| Parameter | Value |
|---|---|
| Dataset size (full config) | 3,000 examples |
| Validation set (full config) | 500 examples |
| Epochs (full config) | 20 |
| Batch size (full config) | 32 |
| Learning rate | 5e-5 |
| Lambda (consistency weight) | 1.0 |
| Model (full config) | GPT-2-style Transformer, `small` config (~10M params); `gpt2_small` documented for ~117M |
| Optimizer | AdamW (weight decay=0.01, grad clip=1.0) |
| Checkpoint interval | every 5 epochs |

## 2. Model Architecture

The model is a **GPT-2-style causal Transformer** implemented from scratch
in PyTorch with identical causal masking semantics to GPT-2.

**Key design choices:**

- **Causal masking**: Standard lower-triangular attention mask.
  Explanation tokens appear *before* claim tokens in the sequence,
  so causal attention already prevents explanation tokens from attending
  to future claim tokens. An explicit additive attention bias further
  enforces this structural constraint.

- **Sequence format**: `<bos> [code] <sep> [explanation] <claim>time_complexity=X</claim>`
  `<claim>space_complexity=Y</claim> <claim>correctness=Z</claim> <eos>`

- **LM head**: Tied to token embeddings. Loss computed over full sequence
  (next-token prediction).

- **Consistency head**: Three linear classifiers (time complexity, space
  complexity, correctness) applied to *mean-pooled hidden states* of
  explanation tokens from the final Transformer layer.

- **Note on GPT-2 weights**: This implementation is a from-scratch
  GPT-2-compatible architecture. Loading pretrained GPT-2 weights would
  require the `transformers` library. The `gpt2_small` config (768 dim,
  12 heads, 12 layers, ~117M params) is provided but requires GPU with
  ≥8GB VRAM.

## 3. Experimental Variants

### V1 — Original Ablation Ladder

| Variant | Description | Ablation axis |
|---|---|---|
| `consistency_loss` | Full mechanism: LM loss + consistency loss on explanation token pooling | Baseline reference |
| `no_consistency_loss` | LM loss only; no gradient through consistency head | Isolates LM-only training |
| `claim_only_pooling` | Negative control: pool *claim* tokens instead of explanation tokens | Tests pooling location |
| `random_label_consistency` | Negative control: consistency loss with shuffled ground-truth labels | Tests label signal |

### V2 — Stronger Ablation Ladder (strict flow + surface bottleneck)

| Variant | Description | Ablation axis |
|---|---|---|
| `no_claim_to_claim_attention` | Like `consistency_loss` but claim tokens **cannot attend other claim tokens**; claim queries see code + explanation + self only | Tests cross-claim information flow |
| `claims_from_explanation_only` | **Strict flow bottleneck**: claim tokens can only attend explanation tokens (not code, not BOS/SEP, not other claims). | Tests whether code-to-claim path can be forced through explanation |
| `surface_bottleneck_consistency` | Consistency signal derived from **softmax distributions** (LM logit probs) at explanation positions, not hidden states. Gradients flow through LM outputs. | Tests whether surface-form explanation must encode claim info |
| `surface_bottleneck_no_expl_lm` | Surface bottleneck + **LM loss disabled on mismatched explanation tokens**. Only code and claim positions contribute to LM loss. | Most extreme: removes incentive to fit mismatched explanation text |

**Key predictions for V2:**

- `no_claim_to_claim_attention`: similar coupling to V1 `consistency_loss` but tests cross-claim span flow.
- `claims_from_explanation_only`: forces code-to-explanation-to-claim information path; explanation hidden states should develop stronger semantic structure.
- `surface_bottleneck_consistency`: tests whether consistency pressure propagates to explanation *token choices* (surface form).
- `surface_bottleneck_no_expl_lm`: most extreme — sole pressure on explanation logits is the surface bottleneck consistency signal.

## 4. Final-Epoch Validation Metrics

*Metrics at epoch 20 (final epoch).*

| Variant | Coupling Strength | BLEU-1 | ROUGE-L | Swap Influence | Claim Accuracy | Val LM Loss |
|---|---|---|---|---|---|---|
| No Claim To Claim Attention | 1.0000 | 0.0758 | 0.0998 | 1.0000 | 1.0000 | 0.0781 |
| Claims From Explanation Only | 1.0000 | 0.0708 | 0.0810 | 1.0000 | 0.8000 | 0.0765 |
| Surface Bottleneck Consistency | 0.6967 | 0.0677 | 0.0889 | -0.1500 | 1.0000 | 0.0729 |
| Surface Bottleneck No Expl Lm | 0.8080 | 0.0033 | 0.0014 | -0.0500 | 0.1000 | 0.0570 |

## 5. Statistical Test: consistency_loss vs no_consistency_loss

Welch's t-test could not be computed (insufficient epoch data).
Note: Unknown reason.

**Interpretation**: With more epochs (full 20-epoch run), the test would
compare BLEU-1 scores in the second half of training across variants.

## 6. Metric Trajectory Summary

### No Claim To Claim Attention

| Metric | Epoch 1 | Epoch 20 | Δ |
|---|---|---|---|
| Coupling Strength | 1.0000 | 1.0000 | +0.0000 |
| BLEU-1 | 0.0183 | 0.0758 | +0.0574 |
| ROUGE-L | 0.0154 | 0.0998 | +0.0844 |
| Swap Influence | 1.0000 | 1.0000 | +0.0000 |
| Claim Accuracy | 0.2833 | 1.0000 | +0.7167 |

### Claims From Explanation Only

| Metric | Epoch 1 | Epoch 20 | Δ |
|---|---|---|---|
| Coupling Strength | 1.0000 | 1.0000 | +0.0000 |
| BLEU-1 | 0.0275 | 0.0708 | +0.0433 |
| ROUGE-L | 0.0233 | 0.0810 | +0.0577 |
| Swap Influence | 1.0000 | 1.0000 | +0.0000 |
| Claim Accuracy | 0.2667 | 0.8000 | +0.5333 |

### Surface Bottleneck Consistency

| Metric | Epoch 1 | Epoch 20 | Δ |
|---|---|---|---|
| Coupling Strength | 0.6967 | 0.6967 | +0.0000 |
| BLEU-1 | 0.0220 | 0.0677 | +0.0457 |
| ROUGE-L | 0.0252 | 0.0889 | +0.0638 |
| Swap Influence | -0.1500 | -0.1500 | +0.0000 |
| Claim Accuracy | 0.6500 | 1.0000 | +0.3500 |

### Surface Bottleneck No Expl Lm

| Metric | Epoch 1 | Epoch 20 | Δ |
|---|---|---|---|
| Coupling Strength | 0.6967 | 0.8080 | +0.1113 |
| BLEU-1 | 0.0000 | 0.0033 | +0.0033 |
| ROUGE-L | 0.0000 | 0.0014 | +0.0014 |
| Swap Influence | -0.1500 | -0.0500 | +0.1000 |
| Claim Accuracy | 0.6500 | 0.1000 | -0.5500 |

## 7. Qualitative Examples: Epoch-1 vs Final Epoch Generations

The following examples are drawn from the `consistency_loss` variant.
They show the model's generated explanation at epoch 1 (essentially random,
as the model just started training) versus the final epoch.

> **Smoke-run note**: With a tiny model and few epochs, generations are
> short and may not yet form coherent prose. The progression from epoch 1
> to the final epoch demonstrates that training is occurring and the
> model is adapting, even if fluency is limited.

> Only 0 qualitative examples were collected
> (fewer than 10 examples available in this run configuration).
> The full run with 3,000 examples would provide richer qualitative analysis.

## 8. Limitations and Interpretation

1. **Smoke run constraints**: The smoke configuration uses a tiny Transformer
   (~0.5–2M params), a reduced dataset, and few epochs. These constraints
   prevent the model from reaching the performance levels expected in the full run.

2. **Tokenizer**: A simple whitespace tokenizer is used (no BPE/SentencePiece).
   This means token sequences are longer than with a subword tokenizer, and
   the vocabulary may not generalize as well.

3. **BLEU/ROUGE proxy**: BLEU-1 and ROUGE-L are computed against ground-truth
   explanation templates, not against diverse human references. They measure
   whether the model recovers the training-set language, not open-ended quality.

4. **Claim emission accuracy**: Measured by string-matching in greedy-decoded
   output. A model could emit the correct claim token by memorizing without
   true generalization.

5. **Coupling vs causality**: Classifier accuracy on explanation hidden states
   measures *correlation*, not causal coupling. The full experiment with
   multiple random seeds and probing experiments would provide stronger evidence.

6. **Full 20-epoch run**: The full configuration (3,000 examples, 20 epochs,
   small model) requires approximately 2–4 hours on a modern GPU. The
   `gpt2_small` config (~117M params) would require ≥8GB GPU VRAM and
   significantly more compute.

## 9. Run Instructions

See `README.md` for full setup and run instructions.

```bash
# Smoke run (fast, ~2-5 min on CPU):
python run_experiment.py --smoke

# Full run (20 epochs, 3000 examples):
python run_experiment.py --full

# Small model, custom config:
python run_experiment.py --full --model small --epochs 20 --batch 32

# GPT-2-style config (GPU required):
python run_experiment.py --full --model gpt2_small
```
