A Dynamic Inference Method for Autoregressive Transformer Models

Published: 2025, Last Modified: 13 Feb 2026ICWS 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Autoregressive transformer models have achieved state-of-the-art performance in advanced services such as text generation and machine translation. Given the significant computational bottlenecks of model inference, layer-wise skipping has emerged as a promising method to accelerate inference by bypassing redundant layers. However, existing methods face challenges, including sub-optimal performance resulting from the premature skipping of critical layers and an unbalanced focus on either multi-head attention or feed-forward sub-blocks, ultimately leading to global performance degradation. In light of the above challenges, we propose a Dynamic Inference Method, named DIM, for autoregressive transformer models. DIM dy-namically selects sub-blocks from both multi-head attention and feed-forward networks through the importance score alignment, ensuring a balanced selection that optimizes both efficiency and model performance. To further mitigate the potential performance loss of skipped sub-blocks, a lightweight adjustment is developed to approximate the computations of skipped sub-blocks. Finally, extensive experiments using several benchmarks validate that DIM outperforms existing inference methods.
Loading