Abstract: As an important method of prompt engineering, In-Context Learning (ICL) provides generalization and knowledge enhancement capabilities for Large Language Models (LLMs). However, the extensive length of retrieved contexts and the limited token throughput in autoregressive models constrain model's reasoning speed. To resolve this constraint, we propose N-Gram-Trie, a novel approach that exploits the potential overlap between the context and model output. The strategy utilizes context to construct an n-gram trie. The trie will be used to construct drafts that benefit the LLM to increase the speed of token generation. We evaluate our method on the summarization, Retrieval-Augmented Generation (RAG) and context Question Answering (context QA) tasks. Experiment results on Vicuna-7B, Llama2-7B-Chat, and Llama3-8B-Instruct all demonstrate demonstrate a significant increase in speed without compromising accuracy. In comparison experiments, our method achieves the best mean speedup among various baselines.
Paper Type: Long
Research Area: Special Theme (conference specific)
Research Area Keywords: efficient models, data-to-text generation, text-to-text generation
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 7973
Loading