Efficient Sparse Decoding for Test-Time Scaling with KV Cache Disaggregation and Asynchronism

Published: 16 Oct 2025, Last Modified: 19 Nov 2025NeurIPS 2025 ER WorkshopEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Sparse Decoding, Test-Time Scaling, Asynchronism
Abstract: While block-level sparse attention has demonstrated superiority in efficiency on large language model test-time scaling tasks through simple yet effective designs, high concurrency and extended decoding processes can severely degrade its efficiency. In this paper, we introduce a paradigm to optimize conventional sparse decoding on test-time scaling tasks. We propose a disaggregated inference architecture that parallelizes inference computation with cache management and selection to reduce decoding latency and increase model serving throughput. Based on this decoupled design, we introduce an asynchronous KV cache filtering algorithm that delivers token-level and fine-grained contextual sparsity without sacrificing model quality under the same sparsity budgets. We evaluate our method on the classical test-time reasoning benchmark against the strong baselines of block-wise sparse decoding, where our method demonstrates comparable or superior performance compared to the strong baseline of block-level sparsity.
Submission Number: 74
Loading