Keywords: Sparsity, Test-Time Scaling, KV Cache
Abstract: While test-time scaling (TTS) significantly unleashes the reasoning capability of large language models (LLMs) through long chain-of-thought (CoT), the linear growth of KV-cache amplifies the memory-bound bottleneck of LLM decoding. Currently, query-aware page-level sparse decoding achieves state-of-the-art performance under constrained computational budgets. However, we argue that this conventional approach is limited by both sequential-dependent page selection and coarse-grained token filtering, hampering both the serving efficiency and model performance on TTS tasks under high concurrency. We propose $\texttt{\textbf{AsyncSpade}}$, a novel asynchronous framework for training-free efficient test-time scaling to tackle these teasers. It is built on two core components: $\textbf{(1) a novel temporal-regressive module}$ that proactively predicts the query embedding of the next token; $\textbf{(2) an asynchronous and disaggregated architecture}$ that decouples the KV cache filtering from the auto-regressive decoding loop, overlapping the token-level KV selection and its associated memory reorganization overhead with the inference computation through asynchronism. Compared to previous work, $\texttt{\textbf{AsyncSpade}}$ first eliminates the sequential dependence without sacrificing model performance.
We validate the effectiveness of $\texttt{\textbf{AsyncSpade}}$ on common LLM serving scenarios with an A100 node, where $\texttt{\textbf{AsyncSpade}}$ can fully overlap the KV-cache-related operations with the inference pipeline, $\textbf{achieving theoretical optimal time-per-output-token (TPOT)}$. Specifically, $\texttt{\textbf{AsyncSpade}}$ delivers over 20\% reduction on TPOT compared to the strong baseline of Quest, and at least 50\% TPOT reduction compared to the full-attention method using Qwen3-32B and 8B models, along with performance that surpasses Quest and rivals full-attention across popular TTS benchmarks.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 1043
Loading