Flash Inference: Near Linear Time Inference for Long Convolution Sequence Models and Beyond

Costin-Andrei Oncescu; Sanket Purandare; Stratos Idreos; Sham M. Kakade

Flash Inference: Near Linear Time Inference for Long Convolution Sequence Models and Beyond

Costin-Andrei Oncescu, Sanket Purandare, Stratos Idreos, Sham M. Kakade

Published: 22 Jan 2025, Last Modified: 24 Feb 2025ICLR 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: computational efficiency, inference, sequence generative models, long convolution sequence models, hyena

TL;DR: We introduce a general method for speeding up exact inference in some classes of generative sequence models

Abstract: While transformers have been at the core of most recent advancements in sequence generative models, their computational cost remains quadratic in sequence length. Several subquadratic architectures have been proposed to address this computational issue. Some of them, including long convolution sequence models (LCSMs), such as Hyena, address this issue at training time but remain quadratic during inference. We propose a method for speeding up LCSMs' exact inference to quasilinear time, identify the key properties that make this possible, and propose a general framework that exploits these. Our approach, inspired by previous work on relaxed polynomial interpolation, is based on a tiling which helps decrease memory movement and share computation. It has the added benefit of allowing for almost complete parallelization across layers of the position-mixing part of the architecture. Empirically, we provide a proof of concept implementation for Hyena, which gets up to $7.8\times$ end-to-end improvement over standard inference by improving $110\times$ within the position-mixing part.

Supplementary Material: zip

Primary Area: generative models

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 9297

Loading