FutureFill: Fast Generation from Convolutional Sequence Models

ICLR 2026 Conference Submission21996 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: convolutional models, fast inference
TL;DR: FutureFill introduces a fast autoregressive generation method for convolutional sequence-prediction algorithms — reducing generation time from quadratic to quasilinear in the context length and is supported by theoretical results and experiments.
Abstract: We address the challenge of efficient auto-regressive generation in sequence prediction models by introducing FutureFill—a general-purpose fast generation method for any sequence prediction algorithm based on convolutional operators. FutureFill reduces generation time from quadratic to quasilinear in the context length. Moreover, when generating from a prompt, it requires a prefill cache whose size grows only with the number of tokens to be generated—often much smaller than the caches required by standard convolutional or attention‐based models. We validate our theoretical claims with language modeling experiments and demonstrate substantial efficiency gains when generating from a deep convolutional sequence prediction model.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 21996
Loading