Learning to Keep a Promise: Scaling Language Model Decoding Parallelism with Learned Asynchronous Decoding

Tian Jin; Ellie Y Cheng; Zachary Ankner; Nikunj Saunshi; Blake M Elias; Amir Yazdanbakhsh; Jonathan Ragan-Kelley; Suvinay Subramanian; Michael Carbin

Learning to Keep a Promise: Scaling Language Model Decoding Parallelism with Learned Asynchronous Decoding

Tian Jin, Ellie Y Cheng, Zachary Ankner, Nikunj Saunshi, Blake M Elias, Amir Yazdanbakhsh, Jonathan Ragan-Kelley, Suvinay Subramanian, Michael Carbin

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We make LLM parallel-generate semantically independent chunks of tokens.

Abstract: Decoding with autoregressive language models traditionally occurs sequentially, generating one token after another. Recent attempts to introduce parallelism require a pre-determined structure in the generated content to implement parallel generation, such as by pattern-matching on bullet points. In this work, we present a new technique to automate parallel generation by dynamically exploiting the semantic independence of generation outputs to implement asynchronous decoding. We introduce an annotation language Pasta-Lang for language models to initiate asynchronous decoding at inference time. We also develop an accompanying Pasta-Lang interpreter that performs on-the-fly asynchronous decoding, effectively implementing parallel generation and speeding up inference. We present an instruction-finetuning dataset with Pasta-Lang-annotated responses for teaching LLMs to annotate semantic independence with Pasta-Lang as well as the methodology for creating the dataset. Our evaluation shows using the interpreter with a Pasta-Lang-equipped model achieves significant speedup while maintaining the same generation quality.

Lay Summary: Most language models write left-to-right, even when different parts of the reply do not depend on each other. PASTA trains the model to tag those independent spans while it is composing. A lightweight interpreter reads the tags, fires off several decoding threads in parallel, and then stitches the finished chunks back into place. This parallel decoding technique largely preserves answer quality while delivering 1.2 ×–1.9 × faster responses. Crucially, the model itself—not hand-written rules—decides what can run in parallel. The approach opens a simple route to faster text generation.

Primary Area: Deep Learning->Large Language Models

Keywords: parallel decoding, large language model, inference acceleration

Submission Number: 7527

Loading