Keywords: Sparse attention, KV cache offloading, Long Context
TL;DR: NOSA is a sparse attention mechanism for KV-cache offloading with native sparse training support, boosting decoding throughput by up to 5.04×, 1.92×, 1.83 over FullAttn, InfLLMv2, ShadowKV.
Abstract: Decoding throughput is often limited by GPU memory dominated by the KV cache.
Existing KV cache offloading reduces memory by storing context on CPU and fetching sparse KV subsets, but training-free methods suffer from long-generation quality degradation, while trainable sparse attention incurs excessive CPU--GPU transfers.
We propose NOSA, a trainable sparse attention mechanism natively designed for KV cache offloading.
NOSA constrains CPU--GPU KV transfer volume to lower communication overhead and improve throughput.
We further build NOSI, an offloading inference system that realizes NOSA's efficiency.
Experiments on {1,3,8}B LLMs show that NOSA improves quality across general, long-input, and long-generation tasks, while boosting decoding throughput by up to $5.04\times$, $1.92\times$, and $1.83\times$ over FullAttn, InfLLMv2, and ShadowKV.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 35
Loading