Keywords: Attention, GPUs, Inference, parallelization, large language models
Abstract: We introduce Parallel Prompting, a novel method for efficiently decoding multiple queries that share a common prefix in large language models (LLMs). This scenario occurs naturally in tasks such as document question answering, few-shot learning, and chatbot systems, where many prompts have substantial overlap. Our approach overcomes shortcomings of prior methods, which either leads to the degraded output quality or inefficient cache management. Crucially, we identify that maximizing inference throughput requires a careful balance between attention parallelism and batch size. The theoretical maximum throughput lies at a point determined by the hardware and model specifics, and cannot be achieved by solely increasing batch size or attention parallelism. In contrast to related methods that forbid hybrid batching or require pre-allocated memory for the entire generation, our approach supports flexible batching across multiple sharing groups and enables dynamic, on-demand memory usage. By decoding all queries in parallel with efficient matrix-matrix operations, our method significantly improves throughput and memory utilization without compromising result quality. Experimental results demonstrate that our method can improve end-to-end Llama3-8B latency by up to 4× against competitive baselines on popular datasets, without compromising output quality or accuracy.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 22390
Loading