N-Gram Trie Speculative Decoding for Faster LLM In-Context Inference

ACL ARR 2025 February Submission7973 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: As an important method of prompt engineering, In-Context Learning (ICL) provides generalization and knowledge enhancement capabilities for Large Language Models (LLMs). However, the extensive length of retrieved contexts and the limited token throughput in autoregressive models constrain model's reasoning speed. To resolve this constraint, we propose N-Gram-Trie, a novel approach that exploits the potential overlap between the context and model output. The strategy utilizes context to construct an n-gram trie. The trie will be used to construct drafts that benefit the LLM to increase the speed of token generation. We evaluate our method on the summarization, Retrieval-Augmented Generation (RAG) and context Question Answering (context QA) tasks. Experiment results on Vicuna-7B, Llama2-7B-Chat, and Llama3-8B-Instruct all demonstrate demonstrate a significant increase in speed without compromising accuracy. In comparison experiments, our method achieves the best mean speedup among various baselines.
Paper Type: Long
Research Area: Special Theme (conference specific)
Research Area Keywords: efficient models, data-to-text generation, text-to-text generation
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 7973
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview