Token Entanglement in Subliminal Learning

Published: 30 Sept 2025, Last Modified: 30 Sept 2025Mech Interp Workshop (NeurIPS 2025) PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: AI Safety, Applications of interpretability
TL;DR: We propose and validate a mechanism for subliminal learning, the phenomenon wherein hidden preferences of a teacher language model are transferred to a student by training on sequences of seemingly random numbers.
Abstract: Subliminal learning is the phenomenon wherein hidden preferences of a teacher language model are transferred to a student by training on sequences of seemingly random data (e.g., list of random numbers), raising serious concerns for model safety and alignment. We propose that *token entanglement* plays a role in this phenomenon. Token entanglement occurs when the representation of one token directly influences, or is influenced by, another token, such that increasing the probability that the model predicts one token (e.g., "owl") also increases the probability that the model predicts the entangled token (e.g., "087"). We show that entangled tokens exist in modern LLMs and develop three methods to identify them: inspecting similarities in the unembedding matrix, analyzing the model's output distribution, and computing token frequency ratios in the fine-tuning data. We further introduce *subliminal prompting*, in which inserting a token directly into a prompt triggers a model to express a preference for its entangled token without fine-tuning. Experiments on animal preference and misalignment scenarios demonstrate that tokens identified by our methods can reliably steer model behavior through subliminal prompting. We further analyze training data, finding that entangled tokens occur more frequently in the subliminal fine-tuning dataset and co-occur with concept tokens in the pretraining data. Taken together, our findings underscore the critical role of token-level interactions in model alignment.
Submission Number: 184
Loading