QuickSilver - Speeding up LLM Inference through Dynamic Token Haltin, KV Skipping, Contextual Token Fusion, and Adaptive Matryoshka Quantization

QuickSilver - Speeding up LLM Inference through Dynamic Token Haltin, KV Skipping, Contextual Token Fusion, and Adaptive Matryoshka Quantization

ACL ARR 2026 January Submission4510 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM efficiency, inference optimization, runtime adaptivity, token-level computation, autoregressive decoding, energy-efficient inference, quantization, large language models

Abstract: Inference has become the main source of resource consumption in large language model (LLM) deployments, frequently accounting for over 90 percent of overall latency, power consumption, and operating cost, now exceeding the one-time cost of training. While substantial progress has been made in improving training efficiency, runtime optimization remains a long-standing bottleneck, particularly under autoregressive decoding. Existing approaches such as pruning, quantization, early exit, and speculative decoding typically require retraining, architectural modifications, or compromises in decoding behavior. We introduce QuickSilver, a token-level, modular framework that enables semantic adaptivity at inference time without changing model structure or weights. QuickSilver combines four complementary mechanisms: (i) Dynamic Token Halting, which detects when token representations converge and halts further computation; (ii) KV Cache Skipping, which avoids unnecessary memory updates for halted tokens to reduce attention cost; (iii) Contextual Token Fusion, which merges semantically similar tokens to reduce redundancy; and (iv) Adaptive Matryoshka Quantization, which dynamically adjusts per-token bit-widths to improve quantization efficiency. Unlike speculative decoding or mixture-of-experts routing, QuickSilver operates entirely at runtime on frozen, dense models without auxiliary networks or retraining. Evaluated on GPT-2 and LLaMA-2 using WikiText-103 and C4, QuickSilver achieves up to 39.6 percent fewer FLOPs with near-zero perplexity degradation (at most 0.2). These results demonstrate a practical, plug-and-play approach for scalable and energy-efficient LLM inference. The code is publicly available to encourage further research and adoption.

Paper Type: Long

Research Area: LLM Efficiency

Research Area Keywords: inference efficiency, runtime optimization, computational cost reduction, memory efficiency, quantization methods, autoregressive decoding, large language models

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 4510

Loading