Keywords: Mixture-of-Depths (MoD), Low-rank factorization, Feed-forward network (FFN) efficiency, Conditional computation, Token routing, Self-guided training, Scaling laws, Efficient Transformers
Abstract: Transformers have achieved strong performance across a wide range of tasks, but their growing size makes efficient training and inference increasingly demanding. Mixture-of-Depths (MoD) reduces this cost by using a per-token router to select a top-\(k\) subset of tokens for standard block computation, while the remaining tokens bypass the layer through the residual stream. In this work, we investigate this idea specifically in the feed-forward network (FFN) module, a major source of computational cost that often accounts for over 60\% of model parameters and FLOPs. We argue that the all-or-nothing design of MoD is suboptimal for FFN computation. Lower-scored tokens may still carry useful signal, and bypassing the FFN entirely removes the hidden-state mixing that could support more effective representation learning. We propose RankMoD, which processes these tokens with a low-rank FFN branch while reserving the full dense FFN for the top-routed tokens. Thus, every token receives an FFN-based update, but only a subset pays the full computational cost. We further adopt self-guided training from prior work, where a dense counterpart guides the low-rank branch during early training and improves performance. Scaling-law experiments show that, under the same FLOP training budget, RankMoD achieves a lower validation loss than both MoD and the dense baseline. With 80\% training FLOPs, it still outperforms. It also exhibits a steeper scaling curve than dense FFN, while MoD scales more slowly, suggesting a stronger computing potential for RankMoD.
Submission Number: 74
Loading