N-Gram Trie Speculative Decoding for Faster LLM In-Context Inference

N-Gram Trie Speculative Decoding for Faster LLM In-Context Inference

ACL ARR 2025 February Submission7973 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: As an important method of prompt engineering, In-Context Learning (ICL) provides generalization and knowledge enhancement capabilities for Large Language Models (LLMs). However, the extensive length of retrieved contexts and the limited token throughput in autoregressive models constrain model's reasoning speed. To resolve this constraint, we propose N-Gram-Trie, a novel approach that exploits the potential overlap between the context and model output. The strategy utilizes context to construct an n-gram trie. The trie will be used to construct drafts that benefit the LLM to increase the speed of token generation. We evaluate our method on the summarization, Retrieval-Augmented Generation (RAG) and context Question Answering (context QA) tasks. Experiment results on Vicuna-7B, Llama2-7B-Chat, and Llama3-8B-Instruct all demonstrate demonstrate a significant increase in speed without compromising accuracy. In comparison experiments, our method achieves the best mean speedup among various baselines.

Paper Type: Long

Research Area: Special Theme (conference specific)

Research Area Keywords: efficient models, data-to-text generation, text-to-text generation

Contribution Types: NLP engineering experiment

Languages Studied: English

Submission Number: 7973

Loading