Keywords: consistency training, AI alignment, sycophancy, jailbreak robustness, prefill attacks, persona attacks, in-context learning attacks, mechanistic interpretability, adversarial robustness
TL;DR: We extend consistency training by varying where it's applied (MLP and Attention layers) and what it targets (persona, frustration, prefill attacks).
Abstract: Consistency training, or encouraging invariant model behavior across clean and perturbed inputs, has shown promise for reducing certain types of misalignment, but existing methods have explored only a narrow slice of the design space. We extend consistency training along two axes: where in the transformer stack to enforce consistency, and what threats to apply it to. We introduce MLP Consistency Training (MLPCT) and Attention Consistency Training (AttCT), which achieve competitive results to previous baselines on reducing sycophancy and jailbreak susceptibility across several methods. We then apply these and prior methods to three new threat models: persona in-context learning attacks, frustration, and prefill attacks, showing that the consistency training framework can be extended to other safety concerns. Our results also reveal striking cross threat generalisation from one threat model to another in certain cases, and that activation-level consistency methods converge on a shared mid-layer representational correction distinct from output-level behavioral training. Our results suggest that consistency is a useful alignment design principle, but only when the agreement target is matched to the structure of the failure mode.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 459
Loading