Block-State Transformers

Jonathan Pilault; Mahan Fathi; Orhan Firat; Christopher Pal; Pierre-Luc Bacon; Ross Goroshin

Block-State Transformers

Jonathan Pilault, Mahan Fathi, Orhan Firat, Christopher Pal, Pierre-Luc Bacon, Ross Goroshin

Published: 21 Sept 2023, Last Modified: 02 Nov 2023NeurIPS 2023 posterEveryoneRevisionsBibTeX

Keywords: State Space Models, Efficient Transformers, Long Range Language Modeling, Language Modeling

TL;DR: The Block-State Transformer combines State Space Models with attention, and outperforms and is more efficient over strong baselines on long sequences.

Abstract: State space models (SSMs) have shown impressive results on tasks that require modeling long-range dependencies and efficiently scale to long sequences owing to their subquadratic runtime complexity. Originally designed for continuous signals, SSMs have shown superior performance on a plethora of tasks, in vision and audio; however, SSMs still lag Transformer performance in Language Modeling tasks. In this work, we propose a hybrid layer named Block-State Transformer (*BST*), that internally combines an SSM sublayer for long-range contextualization, and a Block Transformer sublayer for short-term representation of sequences. We study three different, and completely *parallelizable*, variants that integrate SSMs and block-wise attention. We show that our model outperforms similar Transformer-based architectures on language modeling perplexity and generalizes to longer sequences. In addition, the Block-State Transformer demonstrates a more than *tenfold* increase in speed at the layer level compared to the Block-Recurrent Transformer when model parallelization is employed.

Supplementary Material: pdf

Submission Number: 9266

Loading