# Claim Consistency Coupling — Hard Overlapping-Vocab Experiment

## Overview

This experiment re-runs the four consistency-loss training variants on a
**harder synthetic dataset** where rationale token vocabularies overlap by
approximately 50% across latent states.  Instead of each state having a fully private token range, templates are
constructed from:

- **Shared tokens** (appear in all 8 states)
- **Group tokens** (appear in 2 adjacent states)
- **Local tokens** (unique to one state, minority ~50% of positions)

The model cannot classify states by single unique token identities; it must
learn co-occurrence / combination patterns across the 8-token rationale span.
Claim tokens remain fully state-specific (non-overlapping).

## Hyperparameters

| Parameter | Value |
|---|---|
| hard_overlap_vocab | True |
| overlap_fraction | 0.5 |
| num_train_samples | 512 |
| num_eval_samples | 128 |
| num_shuffled_samples | 128 |
| num_epochs | 10 |
| num_latent_states | 8 |
| num_rationale_templates | 4 |
| d_model | 64 |
| n_layers | 2 |
| n_heads | 4 |
| d_ff | 128 |
| batch_size | 32 |
| lr | 0.0003 |
| consistency_loss_weight | 0.5 |
| seed | 42 |

## Results

| variant             |   final_lm_loss |   final_cons_loss |   gen_claim_acc |   cls_claim_acc (rationale_pool) |   cfact_gen_follows_swap |   cfact_gen_follows_orig |   cfact_cls_follows_swap |   cfact_cls_follows_orig |   shuffled_gen_acc |   shuffled_cls_acc |
|:--------------------|----------------:|------------------:|----------------:|---------------------------------:|-------------------------:|-------------------------:|-------------------------:|-------------------------:|-------------------:|-------------------:|
| no_consistency_loss |          2.3577 |            0      |          1      |                           0.0469 |                   1      |                        0 |                   0.0625 |                   0.1406 |             0.1016 |             0.1719 |
| rationale_only      |          2.4323 |            0.3435 |          1      |                           1      |                   1      |                        0 |                   1      |                   0      |             0.1016 |             0.1016 |
| full_sequence       |          2.4837 |            0.8275 |          0.8125 |                           1      |                   0.7656 |                        0 |                   1      |                   0      |             0.0938 |             0.1016 |
| earlier_token_only  |          2.4884 |            0.8633 |          1      |                           1      |                   1      |                        0 |                   1      |                   0      |             0.1016 |             0.1016 |

## Strong Coupling Threshold Check

Thresholds: `cls_claim_acc (rationale_pool) > 0.9` AND `cfact_cls_follows_swap > 0.9`

| Variant | cls_claim_acc | cfact_cls_follows_swap | Meets Thresholds? |
|---------|:---:|:---:|:---:|
| no_consistency_loss | 0.0469 | 0.0625 | NO |
| rationale_only | 1.0000 | 1.0000 | **YES** |
| full_sequence | 1.0000 | 1.0000 | **YES** |
| earlier_token_only | 1.0000 | 1.0000 | **YES** |

## Column Descriptions

| Column | Description |
|--------|-------------|
| `variant` | Training objective variant (pooling mode for consistency loss) |
| `final_lm_loss` | Cross-entropy LM loss at end of training |
| `final_cons_loss` | Consistency classification loss at end of training |
| `gen_claim_acc` | Greedy generation accuracy: first generated token matches expected claim token |
| `cls_claim_acc (rationale_pool)` | Classifier accuracy from mean-pooled rationale hidden states |
| `cfact_gen_follows_swap` | Rate that generation follows the swapped (wrong) rationale in counterfactual test |
| `cfact_gen_follows_orig` | Rate that generation follows the original claim despite swapped rationale |
| `cfact_cls_follows_swap` | Rate that classifier follows swapped rationale (strong coupling = high) |
| `cfact_cls_follows_orig` | Rate that classifier follows original claim despite swap (low coupling = high) |
| `shuffled_gen_acc` | Generation accuracy under shuffled rationale-claim pairings |
| `shuffled_cls_acc` | Classifier accuracy under shuffled rationale-claim pairings |
