Faster In-Context Learning for LLMs via N-Gram Trie Speculative Decoding

ACL ARR 2025 May Submission6311 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: As a crucial method in prompt engineering, In-Context Learning (ICL) enhances the generalization and knowledge utilization capabilities of Large Language Models (LLMs) (Dong et al., 2024). However, the lengthy retrieved contexts and limited token throughput in autoregressive models significantly constrain reasoning speed. To address this challenge, we propose N-Gram Trie Speculative Decoding, a novel approach that leverages the overlap between context and model output. This method constructs an n-gram trie from the context to generate drafts, accelerating token generation for LLMs. We evaluate our approach on summarization, Retrieval-Augmented Generation (RAG), and context-based Question Answering (QA) tasks. Experimental results on Vicuna-7B, Llama2-7B-Chat, and Llama3-8B-Instruct demonstrate substantial speed improvements without compromising accuracy. Compared with various strong baselines, our method achieves the highest mean speedup, showcasing its effectiveness and efficiency.
Paper Type: Long
Research Area: Generation
Research Area Keywords: efficient models, data-to-text generation, text-to-text generation
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 6311
Loading