## Table 1

| Metric Name | Description |
| :--- | :--- |
| `cls_claim_acc` | Classification accuracy of the consistency head. |
| `gen_claim_acc` | Accuracy of the language model generating the label token. |
| `cfact_cls_follows_swap` | Frequency the classifier follows the label of swapped evidence. |
| `cfact_cls_follows_orig` | Frequency the classifier follows the original label despite swapped evidence. |
| `matched_cfact_cls_follows_swap` | Counterfactual swap metric using lexically similar claims with different labels. |
| `matched_cfact_cls_follows_orig` | Frequency classifier follows original label in matched-claim swaps. |
| `Δcls_claim_acc` | Difference in accuracy vs. no-consistency baseline. |
| `Δcfact_cls_follows_swap` | Change in counterfactual swap behavior vs. baseline. |
| `Δcfact_cls_follows_orig` | Change in original label following behavior vs. baseline. |

## Table 2

| Parameter | Value |
| :--- | :--- |
| Model Name | gpt2 |
| Training Samples | 50,000 |
| Evaluation Samples | 5,000 |
| Max Sequence Length | 256 |
| Epochs | 5 |
| Batch Size | 16 |
| Learning Rate | 5e-5 |
| Consistency Loss Weight | 0.5 |
| Freeze Lower Layers Epochs | 1 |
| Seed | 42 |
| Output Stem | fever50k_tightened_diag |
