everyone
since 13 Oct 2023">EveryoneRevisionsBibTeX
The costly self-attention layers in modern Transformers require memory and compute quadratic in sequence length. Existing approximation methods usually underperform and fail to obtain significant speedups in practice. The recently proposed Flash-Attention reduces both compute and memory through a hardware-aware implementation. Can we achieve this also through algorithmic improvements? Here we present Expert Projection Attention (EPA) - a novel method that reduces both compute and memory requirements, while matching the language modeling performance of baseline Transformers using the same parameter budget. EPA uses Mixture-of-Experts (MoE) layers for the value and output projections and requires 4 to 8 times fewer attention matrices than standard Transformers. Our novel attention can also be combined with MoE MLP layers, resulting in an efficient "Fast Transformer".