Parallel Prompting: Fast LLM Inference for Shared-Context, Short-to-Moderate Output

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Attention, GPUs, Inference, parallelization, large language models
Abstract: We introduce $\emph{Parallel Prompting}$, a method for high-throughput, quality-preserving decoding of multiple large language model (LLM) queries that share a common prefix. Such shared-context structure arises naturally in applications including document question answering, few-shot learning, multi-user chat, and evaluation pipelines. Prior approaches either degrade generation quality by merging queries into a single prompt that the model cannot reliably disentangle or impose rigid batching and preallocated memory that limit practical deployment. Parallel Prompting is a free lunch for batch prompting: it improves throughput and memory efficiency without requiring model retraining or sacrificing accuracy. The gains are most pronounced when prefix overlap is high and output lengths are short to moderate, with the relative advantage diminishing as unique suffixes grow longer. Our method executes a single pass over the shared context and decodes all continuations in parallel through efficient matrix–matrix operations, while avoiding cross-query interference and supporting flexible batching across multiple sharing groups with dynamic, on-demand KV-cache management. This design enables high resource utilization during decoding without compromising output quality. Experiments on popular datasets with Llama 3-8B show up to a 4× reduction in end-to-end latency relative to competitive baselines, with no loss in accuracy, demonstrating that Parallel Prompting complements existing batching strategies and expands the practical throughput of LLM-based systems.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 22390
Loading