QuickSilver - Speeding up LLM Inference through Dynamic Token Haltin, KV Skipping, Contextual Token Fusion, and Adaptive Matryoshka Quantization

ACL ARR 2026 January Submission4510 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM efficiency, inference optimization, runtime adaptivity, token-level computation, autoregressive decoding, energy-efficient inference, quantization, large language models
Abstract: Inference has become the main source of resource consumption in large language model (LLM) deployments, frequently accounting for over 90 percent of overall latency, power consumption, and operating cost, now exceeding the one-time cost of training. While substantial progress has been made in improving training efficiency, runtime optimization remains a long-standing bottleneck, particularly under autoregressive decoding. Existing approaches such as pruning, quantization, early exit, and speculative decoding typically require retraining, architectural modifications, or compromises in decoding behavior. We introduce QuickSilver, a token-level, modular framework that enables semantic adaptivity at inference time without changing model structure or weights. QuickSilver combines four complementary mechanisms: (i) Dynamic Token Halting, which detects when token representations converge and halts further computation; (ii) KV Cache Skipping, which avoids unnecessary memory updates for halted tokens to reduce attention cost; (iii) Contextual Token Fusion, which merges semantically similar tokens to reduce redundancy; and (iv) Adaptive Matryoshka Quantization, which dynamically adjusts per-token bit-widths to improve quantization efficiency. Unlike speculative decoding or mixture-of-experts routing, QuickSilver operates entirely at runtime on frozen, dense models without auxiliary networks or retraining. Evaluated on GPT-2 and LLaMA-2 using WikiText-103 and C4, QuickSilver achieves up to 39.6 percent fewer FLOPs with near-zero perplexity degradation (at most 0.2). These results demonstrate a practical, plug-and-play approach for scalable and energy-efficient LLM inference. The code is publicly available to encourage further research and adoption.
Paper Type: Long
Research Area: LLM Efficiency
Research Area Keywords: inference efficiency, runtime optimization, computational cost reduction, memory efficiency, quantization methods, autoregressive decoding, large language models
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 4510
Loading