Accelerated Inference with Long-Sequence Transformers on CPUs

Yuzhen Mao; Martin Ester; Ke Li

Accelerated Inference with Long-Sequence Transformers on CPUs

Yuzhen Mao, Martin Ester, Ke Li

Published: 10 Jun 2025, Last Modified: 02 Jul 2025LCFM 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Efficient LLM; Inference-time Efficiency; Sparse Attention

Abstract: One limitation of existing Transformer-based models is that they cannot handle very long sequences as input since their self-attention operations exhibit quadratic time and space complexity. This problem becomes especially acute when Transformers are deployed on hardware platforms equipped only with CPUs. To address this issue, we propose a novel method for accelerating self-attention at inference time that works with pretrained Transformer models out-of-the-box without requiring retraining. We experiment using our method to accelerate various long-sequence Transformers, including a leading LLaMA 2-based LLM, on various benchmarks and demonstrate a speedup of $2.73\times - 7.63\times$ while retaining $98.6\% - 99.6\%$ of the accuracy of the original pretrained models. The code is available on our project website at https://yuzhenmao.github.io/IceFormer.

Submission Number: 38

Loading