Intra-Prompt Parallel Decoding for Common-Context Question Answering

Theodore Glavas; Nikhita Vedula; Dushyanta Dhyani; Antonios Valkanas; Yilun Zhu; Shervin Malmasi

Intra-Prompt Parallel Decoding for Common-Context Question Answering

Theodore Glavas, Nikhita Vedula, Dushyanta Dhyani, Antonios Valkanas, Yilun Zhu, Shervin Malmasi

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Parallel Decoding, Efficient Inference, Question Answering, Generation

TL;DR: Intra-Prompt Parallel Decoding (IPPD) lets LLMs answer multiple questions from the same context simultaneously within a single prompt, significantly boosting throughput.

Abstract: In common-context question answering (CCQA) tasks, multiple questions share a common context to base their answers from. However, Large Language Models (LLMs) typically generate each answer using an independent prompt. While existing batching and caching techniques help improve parallelism and reduce repeated computations, the separation of questions across prompts limits the achievable speedup, as modern GPUs are underutilized due to a memory bottleneck during attention. We present Intra-Prompt Parallel Decoding (IPPD), a novel inference method that answers multiple common-context questions in parallel within a single prompt. IPPD directly addresses the bottleneck by efficiently sharing both memory and computation during the attention process, as the next token for every question is decoded in a single inference step. IPPD uses virtual position IDs and attention mask manipulation to generate the same output as standard prompting without requiring fine-tuning or any changes to the LLM architecture. Since all parallelism occurs within a prompt, IPPD is fully compatible with batched inference, even when each prompt features a different context. Our experiments show that IPPD delivers up to 7X the effective throughput as standard decoding without quality degradation, and outperforms state-of-art inference acceleration methods on real-world datasets.

Primary Area: generative models

Submission Number: 23475

Loading