Fast LLM Inference with Parallel Prompting

ACL ARR 2024 December Submission2357 Authors

16 Dec 2024 (modified: 05 Feb 2025)ACL ARR 2024 December SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: This paper presents a new method for efficiently decoding multiple queries over the same content in Transformer language models. This is particularly useful for tasks that have many prompts with the shared prefix, as document question answering with a large number of questions for each document. Traditional methods prompt the language model with each query independently in a batch or combine multiple questions together into one larger prompt. However, both approaches are based on the autoregressive fashion with one token per homogeneous forward pass, which uses inefficient matrix-vector products for every sequence in the batch. These methods also encounter issues such as a duplicate key-value cache, quality degradation, or redundant memory when large key-value (KV) caches are accessed from memory, which leads to wasted GPU memory and decreased performance. Our proposed method addresses these challenges by decoding queries in parallel, replacing matrix-vector products with more efficient matrix-matrix products, improving efficiency without compromising result quality. Experimental results demonstrate that our method increases throughput effectively in multiple downstream tasks, providing a reliable solution for prompt inference in language models.
Paper Type: Long
Research Area: Generation
Research Area Keywords: Generation,Language Modeling,Efficient/Low-Resource Methods for NLP,Question Answering
Contribution Types: NLP engineering experiment, Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 2357
Loading