Track: Tiny Paper Track (Page limit: 3-5 pages)
Keywords: subliminal learning, large language models, kl-regularization, fine-tuning
TL;DR: We reveal early-step subliminal trait transfer in LLM fine-tuning and introduce liminal training to mitigate it.
Abstract: Subliminal learning, the unintended transmission of behavioral traits like misalignment or preference through semantically unrelated fine-tuning data, represents a critical and poorly understood phenomenon in Large Language Models (LLMs). We provide a detailed dynamic characterization of subliminal learning, focusing on the temporal evolution of trait acquisition during fine-tuning of Qwen2.5-1.5B-Instruct and Qwen2.5-3B-Instruct models. We find that the trait acquisition is a batch-invariant, non-linear spike concentrated sharply within the initial 10--20 training steps. We hypothesize that these dynamics are symptoms of a model transitions to a vulnerable parameter region.
We then propose liminal training, which consists of adding an annealed KL regularizer to the fine-tuning loss, and provably mitigates subliminal learning, preventing the acquisition of unwanted traits.
Submission Number: 37
Loading