Logit Grafting: The Post-Training Delta is Sparse, Portable, and Powerful
Keywords: post-training delta, logit-grafting, decoding-time alignment, weak-to-strong generalization, compositionality of post-training deltas, knowledge transfer
TL;DR: Post-training behavior can be extracted as a sparse logit delta and grafted onto larger models to transfer math reasoning and alignment gains at inference time.
Abstract: Post-training aligns a base language model, but the resulting behavioral change remains locked inside the post-trained model’s weights. We study logit grafting as a framework for understanding whether this change can be extracted and transferred: we compute the token-level logit difference between a small post-trained model and its base counterpart, then add this post-training delta to a larger target model’s logits at decode time. We show that grafting exponentially tilts the target distribution and yields the unique optimizer of a KL-regularized reward maximization objective, making the guidance strength αinterpretable. Empirically, we find that the post-training delta is sparse, portable, and powerful. It changes only 4–13% of generated tokens and concentrates at high-entropy positions. It also transfers across model families: a 1.5B Qwen delta, mapped through a vocabulary bridge, raises the accuracy of LLaMA-3-8B on GSM8K by 15%. Within the same family, grafting a 1.5B delta onto a 7B base model recovers 84–92% of the accuracy gap to the fully post-trained 7B model on math benchmarks, and the resulting model is preferred to the donor on most alignment and truthfulness evaluations. In several settings, the grafted 7B model also outperforms the 1.5B post-trained donor, suggesting a form of weak-to-strong generalization at inference time.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 123
Loading