Router-Tuning: A Simple and Effective Approach for Dynamic Depth

Router-Tuning: A Simple and Effective Approach for Dynamic Depth

ACL ARR 2025 May Submission6204 Authors

20 May 2025 (modified: 29 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: The Mixture of Depths (MoD) was introduced to improve computational efficiency by dynamically skipping less important layers, reducing redundant computation while maintaining model capacity. Despite its promise, existing MoD approaches remain under-explored and face two main challenges: (1) \textit{high training costs due to the need to train the entire model along with the routers that determine which layers to skip}, and (2) \textit{performance degradation when important layers are bypassed}. In response to the first issue, we propose Router-Tuning, which fine-tunes only the routers on a small dataset, drastically reducing the computational overhead associated with full model training. For the second challenge, we investigate \method~across different architectures and granularities, demonstrating its effectiveness on Attention layers and MoE layers. This method preserves the model’s performance while significantly enhancing computational and memory efficiency. Extensive experiments demonstrate that our approach delivers competitive results while dramatically improving the computation efficiency, e.g., 21\% speedup and only a 0.2\% performance drop. The code will be released upon acceptance.

Paper Type: Long

Research Area: Efficient/Low-Resource Methods for NLP

Research Area Keywords: Computation Efficiency, Mixture of Depths, Layer Skip

Contribution Types: Approaches low compute settings-efficiency

Languages Studied: English

Submission Number: 6204

Loading