# Claim Consistency – Hidden-State Intervention / Causal Patching Results

## Methodology

For each sample pair (original rationale from latent state A, swapped rationale from latent state B), the evaluation:
1. Runs a forward pass with the **original** sequence and caches post-block hidden states   at rationale token positions after each transformer block.
2. For each block `i`, runs a second forward pass with the **swapped** sequence but   **replaces** the hidden states at rationale positions after block `i` with the cached   original states — letting all subsequent blocks process the patched activations.
3. Reads the logit at the final SEP position (immediately before the claim span) to   identify the predicted first claim token (greedy, single-token).
4. Records whether the prediction matches the **original** latent state's claim token   (`intervention_follows_original_hs`) or the **swapped** state's claim token   (`intervention_follows_swapped_tokens`).

**Claim prediction position**: logit at index `prefix_len − 1` = position 14 (0-based), consistent with greedy next-token generation used in `generate_claim()`.

## Hyperparameters

| Parameter | Value |
|---|---|
| num_train_samples | 5120 |
| num_epochs | 30 |
| num_latent_states | 8 |
| n_layers | 2 |
| d_model | 64 |
| n_heads | 4 |
| d_ff | 128 |
| n_intervention_samples | 64 |
| seed | 42 |

## Results

| variant             |   layer | patch_layer_id   |   intervention_follows_original_hs |   intervention_follows_swapped_tokens |   n_samples |
|:--------------------|--------:|:-----------------|-----------------------------------:|--------------------------------------:|------------:|
| no_consistency_loss |       0 | block_0          |                             0.3125 |                                0.6875 |          64 |
| no_consistency_loss |       1 | block_1          |                             0      |                                1      |          64 |
| rationale_only      |       0 | block_0          |                             0.8906 |                                0.1094 |          64 |
| rationale_only      |       1 | block_1          |                             0      |                                1      |          64 |
| full_sequence       |       0 | block_0          |                             0.75   |                                0.25   |          64 |
| full_sequence       |       1 | block_1          |                             0      |                                1      |          64 |
| earlier_token_only  |       0 | block_0          |                             0.7344 |                                0.25   |          64 |
| earlier_token_only  |       1 | block_1          |                             0      |                                1      |          64 |

## Column Descriptions

- **variant**: Training objective variant (pooling mode for consistency loss)
- **layer**: 0-based transformer block index
- **patch_layer_id**: Human-readable block label (e.g. `block_0`)
- **intervention_follows_original_hs**: Fraction of samples where patching the hidden states at this block causes the model to predict the *original* latent state's claim token (higher = patched HS dominate)
- **intervention_follows_swapped_tokens**: Fraction of samples where the patched model still predicts the *swapped* rationale's claim token (higher = surface tokens still dominate despite patch)
- **n_samples**: Number of (orig, swap) sample pairs evaluated
