You Didn't Have to Say It like That: Subliminal Learning from Faithful Paraphrases

11 Oct 2025 (modified: 03 Nov 2025)Submitted to UPLB2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: preference transmission, subliminal learning, language model finetuning, implicit bias, paraphrasing, formulation effects, trait preferences, semantic filtering, alignment safety, hidden preferences, model behavior transmission
TL;DR: Hidden preference transmission via faithful paraphrases persists despite aggressive filtering and contradictory semantic content
Track: Regular Paper
Abstract: Language models are capable of transmitting behavioral traits in an opaque manner during self-distillation. \textit{Subliminal learning} refers to teacher preferences being passed on to a student via unrelated data. In particular, this mechanism has been shown to transmit misalignment and biases. Given the increasing use of self-distillation, it is critical to understand the breadth of this phenomenon. We investigate whether preference information can be leaked through the formulation of natural language sentences with fixed meaning, demonstrating transmission via \textit{faithful natural language paraphrases}, despite aggressive filtering. Specifically, finetuning on dolphin- or eagle-loving teachers increases preference by approximately 21 percentage points compared to training on neutral paraphrases (p < 0.001), while owl, dog, and fly show no significant transmission. Training on neutral paraphrases produces preferences similar to baseline, validating our experimental design. Additionally, we investigate whether \textit{semantic opposition} blocks transmission by training on anti-dolphin sentences paraphrased by dolphin-loving teachers. We find virtually identical transmission (+18.8pp) compared to unrelated content (+20.9pp), indicating that implicit patterns persist despite contradictory explicit semantics. Keyword analysis revealed no interpretable patterns for dolphin, but weak associations for eagle (e.g., "habitats", "aerating", "striking"), though whether these reflect genuine encoding remains unclear. These results suggest that \textit{subliminal learning} is a much broader phenomenon than previously demonstrated. Combined with semantic opposition failing to block transmission, this raises concerns about the detectability and prevention of covert bias propagation during self-distillation.
Submission Number: 31
Loading