Transformer-VQ: Linear-Time Transformers via Vector Quantization

Lucas Dax Lingle

Transformer-VQ: Linear-Time Transformers via Vector Quantization

Lucas Dax Lingle

Published: 16 Jan 2024, Last Modified: 05 Mar 2024ICLR 2024 posterEveryoneRevisionsBibTeX

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: Transformer, Transformer Decoder, Decoder-Only Transformer, Natural Language Processing, NLP, Vector Quantization, VQ, K-Means, Clustering, Causal Attention, Autoregressive Attention, Efficient Attention, Linear-Time Attention, Autoregressive Modeling, Generative Modeling, Gated Attention, Compressive Attention, Kernelized Attention, Kernelizable Attention, Hierarchical Attention, Segment-Level Recurrent Attention, Long-Context Modeling, Long-Range Modeling, Long-Range Dependencies, Long-Term Dependencies, Cached Attention, Shift-Equivariant Attention

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

TL;DR: Simple, efficient, and stable decoder-only attention via vector quantization

Abstract: We introduce Transformer-VQ, a decoder-only transformer computing softmax-based dense self-attention in linear time. Transformer-VQ's efficient attention is enabled by vector-quantized keys and a novel caching mechanism. In our large-scale experiments, Transformer-VQ is shown highly competitive in quality, obtaining 0.99 bpb on Enwik8, 26.6 ppl on PG-19, and 3.16 bpb on ImageNet64. In addition, the optimized implementation of Transformer-VQ is over 3x faster than a comparable quadratic-time transformer at sequence length 8k, is over 12x faster at 32k, and can scale to 131k with similar throughput. Code available: \url{https://github.com/transformer-vq/transformer_vq}

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Primary Area: generative models

Submission Number: 8575

Loading