Keywords: AI Safety, Applications of interpretability
TL;DR: We identify entangled tokens - seemingly unrelated tokens where increasing the probability of one also increases the probability of the other - and use these to reliably steer model behavior.
Abstract: Subliminal learning is the phenomenon wherein hidden preferences of a teacher language model are transferred to a student by training on sequences of seemingly unrelated data (e.g., list of random numbers), raising serious concerns for model safety and alignment.
We propose that token entanglement plays a role in this phenomenon.
Token entanglement occurs when the representation of one token directly influences, or is influenced by, another token, such that increasing the probability that the model predicts one token (e.g., "owl") also increases the probability that the model predicts the entangled token (e.g., "087").
We show that entangled tokens exist in modern LLMs and develop three methods to identify them: inspecting similarities in the unembedding matrix, analyzing the model's output distribution, and computing token frequency ratios in the fine-tuning data used to demonstrate subliminal learning.
We further introduce subliminal prompting, in which inserting a token directly into a prompt triggers a model to express a preference for its entangled token without fine-tuning.
Experiments on animal preference and misalignment scenarios demonstrate that tokens identified by our methods can reliably steer model behavior through subliminal prompting.
Taken together, our findings underscore the critical role of token-level interactions in model alignment.
Submission Number: 184
Loading