Understanding Layer Patching in Model Size Interpolation

Published: 01 Jun 2026, Last Modified: 01 Jun 2026AdaptFM PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: model size interpolation, dynamic depth, elastic architectures, knowledge distillation, pruning
TL;DR: Our experiments across small and large language models show that student layer patching strongly shapes model size interpolation behavior and we propose a greedy algorithm to approximately solve the optimal layer subset problem.
Abstract: Zero-shot model size interpolation aims to create new models of intermediate target sizes by combining existing models without additional training. Recent work on boomerang distillation (Kangaslahti et al., 2026) shows that a student language model distilled from a larger teacher can be expanded by iteratively patching its layers, replacing student layers with contiguous blocks of teacher layers to obtain models whose size and performance interpolate between the student and the teacher. In this work, we provide the first systematic study of student-layer selection for model size interpolation. We cast finding the optimal layer subset for each model size as an optimization problem and prove it can be viewed as a shortest-path problem in a certain acyclic graph. In experiments, we show that patching strongly shapes interpolation behavior, with effects that vary substantially across model families. We find that simple sequential strategies---patching either from the first layer to the last or from the last to the first---often achieve surprisingly strong performance in practice. We further introduce KLPatch, a greedy patching algorithm based on KL divergence, which often improves over last-to-first patching and approximately solves the optimization problem. Together, our results provide a principled understanding of how layer patching affects model size interpolation and offer practical guidance for constructing near-optimal interpolated models.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 91
Loading