QuickSilver - Speeding up LLM Inference through Dynamic Token Halting, KV Skipping, Contextual Token Fusion, and Adaptive Matryoshka Quantization
Abstract: Inference has become the dominant driver of resource consumption in large language model (LLM) deployments, often accounting for over 90% of total latency, energy use, and operational cost—surpassing even the one-time expense of training. While training-time efficiency has advanced considerably, runtime optimization remains a critical bottleneck, particularly under autoregressive decoding. Existing methods—such as pruning, quantization, early exits, and speculative decoding—often require retraining, architectural modifications, or a compromise in decoding compatibility. We present QuickSilver, a modular, token-level framework that enables semantic adaptivity at inference time without altering model weights or structure. QuickSilver integrates four synergistic mechanisms: (i) Dynamic Token Halting, which detects representational stability and halts further computation for semantically saturated tokens; (ii) KV Cache Skipping, which skips memory updates for halted tokens, streamlining attention layers and lowering runtime cost; (iii) Contextual Token Fusion, which identifies and merges similar tokens during inference, streamlining token flow and minimizing redundancy; and (iv) Adaptive Matryoshka Quantization, which dynamically adjusts token-level bit-widths for efficient quantization. Unlike speculative decoding or mixture-of-experts routing, QuickSilver operates entirely at runtime on frozen, dense models—requiring no auxiliary networks or retraining. Evaluated on GPT-2 and Llama-2 across WikiText-103 and C4, QuickSilver achieves up to 39.6% FLOP reduction with minimal perplexity degradation ($\leq$0.2), offering a lightweight, plug-and-play path toward scalable, energy-efficient inference. To facilitate adoption and further research, we release our implementation publicly.
Paper Type: Long
Research Area: Efficient/Low-Resource Methods for NLP
Research Area Keywords: LLM Inference, Dynamic Token Halting, KV Skipping, Adaptive Matryoshka Quantization
Contribution Types: Approaches low compute settings-efficiency
Languages Studied: English
Reassignment Request Area Chair: This is not a resubmission
Reassignment Request Reviewers: This is not a resubmission
A1 Limitations Section: This paper has a limitations section.
A2 Potential Risks: Yes
A2 Elaboration: Section 5: Broader Impact
B Use Or Create Scientific Artifacts: Yes
B1 Cite Creators Of Artifacts: Yes
B1 Elaboration: Section 3: Performance
B2 Discuss The License For Artifacts: N/A
B3 Artifact Use Consistent With Intended Use: Yes
B3 Elaboration: Section 3: Performance
B4 Data Contains Personally Identifying Info Or Offensive Content: N/A
B5 Documentation Of Artifacts: N/A
B5 Elaboration: We did not create any new dataset, rather use existing - Section 3: Performance
B6 Statistics For Data: N/A
B6 Elaboration: We did not create any new dataset, rather use existing - Section 3: Performance
C Computational Experiments: Yes
C1 Model Size And Budget: Yes
C1 Elaboration: Appendix G: Implementation Details andHyperparameters
C2 Experimental Setup And Hyperparameters: Yes
C2 Elaboration: Appendix G: Implementation Details andHyperparameters
C3 Descriptive Statistics: N/A
C3 Elaboration: Section 3: Performance
C4 Parameters For Packages: Yes
C4 Elaboration: Appendix G: Implementation Details andHyperparameters
D Human Subjects Including Annotators: No
D1 Instructions Given To Participants: N/A
D1 Elaboration: We did not create any new dataset, rather use existing - Section 3: Performance
D2 Recruitment And Payment: N/A
D2 Elaboration: We did not create any new dataset, rather use existing - Section 3: Performance
D3 Data Consent: N/A
D3 Elaboration: We did not create any new dataset, rather use existing - Section 3: Performance
D4 Ethics Review Board Approval: N/A
D4 Elaboration: We did not create any new dataset, rather use existing - Section 3: Performance
D5 Characteristics Of Annotators: N/A
D5 Elaboration: We did not create any new dataset, rather use existing - Section 3: Performance
E Ai Assistants In Research Or Writing: No
E1 Information About Use Of Ai Assistants: No
E1 Elaboration: We did not create any new dataset, rather use existing - Section 3: Performance
Author Submission Checklist: yes
Submission Number: 594
Loading