Subliminal Signals in Preference Labels

Published: 01 Mar 2026, Last Modified: 24 Apr 2026ICLR 2026 AIWILDEveryoneRevisionsCC BY 4.0
Keywords: LLM-as-a-judge, alignment, bias, DPO, subliminal-learning
TL;DR: We study subliminal learning in preference-based alignment: even in a highly constrained setting where a neutral student model generates unbiased completions, a biased judge can transmit behavioural traits through binary preference labels alone.
Abstract: As AI systems approach superhuman capabilities, scalable oversight increasingly relies on LLM-as-a-judge frameworks where models evaluate and guide each other's training. A core assumption is that binary preference labels provide only semantic supervision about response quality. We challenge this assumption by demonstrating that preference labels can function as a covert communication channel. We show that even when a neutral student model generates semantically unbiased completions, a biased judge can transmit unintended behavioral traits through preference assignments, which even strengthen across iterative alignment rounds. Our findings suggest that robust oversight in superalignment settings requires mechanisms that can detect and mitigate subliminal preference transmission, particularly when judges may pursue unintended objectives.
PDF: pdf
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 133
Loading