Deterministic Continuous Replacement: Fast and Stable Module Replacement in Pretrained Transformers

Published: 22 Sept 2025, Last Modified: 25 Nov 2025ScaleOPT PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: module replacement, knowledge distillation, transformer compression, efficient attention, attention replacement, attention distillation, optimization, stability, convergence, scalable optimization, parallel algorithm
TL;DR: Deterministic Continuous Replacement is a simple way to swap modules in pretrained Transformers: more stable and faster than stochastic gates or KD
Abstract: Replacing modules in pretrained models—especially swapping quadratic self-attention for efficient attention alternatives—poses a hard optimization problem: cold-start reinitialization destabilizes frozen backbones. We isolate this core stability challenge in a controlled study. Deterministic Continuous Replacement (DCR) blends teacher and student outputs with a deterministic, annealed weight $\alpha(t)$. Theoretically, DCR eliminates gate-induced gradient variance inherent to stochastic replacement. Empirically, DCR attains faster convergence and stronger alignment than stochastic gating and distillation baselines on controlled attention replacement, establishing a foundation for heterogeneous operator swaps.
Submission Number: 31
Loading