Keywords: prompting, hallucination, retrive, large language model
Abstract: For efficiency, preference alignment is often performed on compact, knowledge-distilled (KD) models. We argue this common practice introduces a significant limitation by overlooking a key property of the alignment's reference model: its ability to cover the full range of the underlying distribution. We show that the standard KD -> Align workflow diminishes the model's capacity to recover specific target capabilities that were pruned during distillation, even under strong preference signals. We instead demonstrate that reversing the pipeline (i.e., Align -> KD) is essential: alignment must first be performed on a reference model with broad distributional coverage before distillation. Our contributions are threefold. First, we provide a minimal working explanation of how the reference model constrains preference alignment objectives at a fundamental level. Second, we validate this theory in a controllable Mixture-of-Gaussians experiment, where anchoring to a limited-coverage reference consistently results in suboptimal model performance. Finally, we demonstrate that the same phenomenon holds in LLM alignment with the SmolLM2 family: models aligned after KD fail to effectively recover intended capabilities, resulting in substantially lower reward and target precision. In contrast, our proposed Align -> KD pipeline robustly captures these capabilities, yielding models with superior target-oriented metrics and lower variance. Together, these results establish the reference model's distributional coverage as a first-order design choice in alignment, offering a clear principle: alignment must precede distillation.
Submission Number: 88
Loading