Abstract: Scaling language models to handle longer contexts introduces substantial memory challenges due to the growing cost of key-value (KV) caches.
Motivated by the efficiency gains of hybrid models and the broad availability of pretrained large transformer backbones, we explore transitioning transformer models into hybrid architectures for a more efficient generation.
In this work, we propose \textsc{LightTransfer},
a lightweight method that transforms models such as LLaMA into hybrid variants.
Our approach identifies \textit{lazy} layers---those focusing on recent or initial tokens---and replaces their full attention with streaming attention.
This transformation can be performed without any training for long-context understanding tasks or with minimal fine-tuning for o1-like long reasoning generation tasks that require stronger reasoning capabilities.
Experiments across diverse benchmarks and models (e.g., LLaMA, Mistral, QwQ-STILL) demonstrate that,
even when half of the layers are identified as \textit{lazy},
\textsc{LightTransfer} achieves up to 2.17$\times$ throughput improvement with minimal performance loss ($<1.5\%$ on LongBench) and achieves 53.3\% on math benchmark AIME24 of advanced o1-like long reasoning model QwQ-STILL.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Li_Dong1
Submission Number: 5063
Loading