M2R2: EFFICIENT TRANSFORMERS WITH MIXTURE OF MULTI-RATE RESIDUALS

Published: 05 Mar 2025, Last Modified: 14 Apr 2025SCOPE - ICLR 2025 OralEveryoneRevisionsBibTeXCC BY 4.0
Track: Main paper track (up to 5 pages excluding references and appendix)
Keywords: inference optimization, LLM, Dynamic computation, speculative decoding, MoE, Mixture of Experts
TL;DR: M2R2 dynamically modulates residual velocities for early alignment, improving efficiency in dynamic computing, speculative decoding, and MoE models, achieving up to 2.9× speedup.
Abstract: Residual transformations play a crucial role in enhancing the representational depth and expressive power of large language models (LLMs). However, static residual transformation during auto-regressive generation leads to a sub-optimal balance between inference efficiency and generation fidelity. Existing methods such as Early Exiting, Mixture of Depths, Skip Decoding focus on token traversal distance across layers to enable dynamic transformation but overlook the velocity of residual evolution, leading to suboptimal inference efficiency. We introduce \textit{Mixture of Multi-rate Residuals} (M2R2), a framework that dynamically modulates residual velocities to ensure early alignment of intermediate representations. M2R2 shows improvements across \textit{dynamic computing}, \textit{speculative decoding}, and \textit{Mixture-of-Experts} (MoE) architectures. In dynamic computing settings, M2R2 outperforms state-of-the-art distance-based strategies, achieving a superior trade-off between generation metrics and speedup. In self-speculative decoding, M2R2 achieves up to 2.8× speedup on MT-Bench and, in MoE models, up to 2.9× speedup with ahead-of-time expert loading. This positions M2R2 as an effective strategy for mobile resource-constrained deployment.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 15
Loading