Liminal Training: Characterizing and Mitigating Subliminal Learning in Large Language Models

Published: 11 Nov 2025, Last Modified: 23 Dec 2025XAI4Science Workshop 2026EveryoneRevisionsBibTeXCC BY 4.0
Track: Tiny Paper Track (Page limit: 3-5 pages)
Keywords: subliminal learning, large language models, kl-regularization, fine-tuning
TL;DR: We reveal early-step subliminal trait transfer in LLM fine-tuning and introduce liminal training to mitigate it.
Abstract: Subliminal learning, the unintended transmission of behavioral traits like misalignment or preference through semantically unrelated fine-tuning data, represents a critical and poorly understood phenomenon in Large Language Models (LLMs). We provide a detailed dynamic characterization of subliminal learning, focusing on the temporal evolution of trait acquisition during fine-tuning of Qwen2.5-1.5B-Instruct and Qwen2.5-3B-Instruct models. We find that the trait acquisition is a batch-invariant, non-linear spike concentrated sharply within the initial 10--20 training steps. We hypothesize that these dynamics are symptoms of a model transitions to a vulnerable parameter region. We then propose liminal training, which consists of adding an annealed KL regularizer to the fine-tuning loss, and provably mitigates subliminal learning, preventing the acquisition of unwanted traits.
Submission Number: 37
Loading