TokenSwift: Lossless Acceleration of Ultra Long Sequence Generation

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Generating ultra-long sequences with large language models (LLMs) has become increasingly crucial but remains a highly time-intensive task, particularly for sequences up to 100K tokens. While traditional speculative decoding methods exist, simply extending their generation limits fails to accelerate the process and can be detrimental. Through an in-depth analysis, we identify three major challenges hindering efficient generation: frequent model reloading, dynamic key-value (KV) management and repetitive generation. To address these issues, we introduce TokenSwift, a novel framework designed to substantially accelerate the generation process of ultra-long sequences while maintaining the target model’s inherent quality. Experimental results demonstrate that TokenSwift achieves over $3 \times$ speedup across models of varying scales (1.5B, 7B, 8B, 14B) and architectures (MHA, GQA). This acceleration translates to hours of time savings for ultra-long sequence generation, establishing TokenSwift as a scalable and effective solution at unprecedented lengths.
Lay Summary: Generating very long outputs (up to 100K tokens) from large language models is painfully slow because current approaches frequently reload the model, inefficiently manage the growing key–value cache, and often re-compute tokens. To tackle these bottlenecks, we developed TokenSwift, a framework that keeps the model in memory, dynamically updates its key–value cache, and skips redundant token computations. By combining these optimizations, TokenSwift accelerates ultra-long sequence generation without changing the model’s predictions. In tests on models ranging from 1.5 B to 14 B parameters—including both standard multi-head attention and grouped-query attention architectures—TokenSwift consistently achieved over $3 \times$ speedups. This speed boost translates into saving hours of runtime on demanding tasks, making it practical to generate unprecedentedly long sequences in research and real-world applications.
Link To Code: https://github.com/bigai-nlco/TokenSwift
Primary Area: Deep Learning->Large Language Models
Keywords: Large Language Models, Lossless Acceleration, Ultra-Long Sequence Generation
Submission Number: 1543
Loading