Keywords: subliminal learning, large language models, kl-regularization, fine-tuning
TL;DR: Finds “subliminal learning”: unrelated fine-tuning causes sudden unwanted traits in small LLMs within 10–20 steps. “Liminal training”—an KL-regularization during early fine-tuning—removes these spikes while keeping task performance intact.
Abstract: Subliminal learning, the unintended transmission of behavioral traits like misalignment or preference through semantically unrelated fine-tuning data, represents a critical and poorly understood phenomenon in Large Language Models (LLMs). We provide a detailed dynamic characterization of subliminal learning, focusing on the temporal evolution of trait acquisition during fine-tuning of Qwen2.5-1.5B-Instruct and Qwen2.5-3B-Instruct models. We find that the trait acquisition is a batch-invariant, non-linear spike concentrated sharply within the initial 10--20 training steps. We hypothesize that these dynamics are symptoms of a model transitions to a vulnerable parameter region.
We then propose liminal training, which consists of adding an annealed KL regularizer to the fine-tuning loss, and provably mitigates subliminal learning, preventing the acquisition of unwanted traits.
Submission Number: 134
Loading