Composite Slice Transformer: An Efficient Transformer with Composition of Multi-Scale Multi-Range Attentions

Mingu Lee; Saurabh Pitre; Tianyu Jiang; Pierre-David Letourneau; Matthew J Morse; Kanghwan Jang; Joseph Soriaga; Parham Noorzad; Hsin-Pai Cheng; Christopher Lott

Composite Slice Transformer: An Efficient Transformer with Composition of Multi-Scale Multi-Range Attentions

Mingu Lee, Saurabh Pitre, Tianyu Jiang, Pierre-David Letourneau, Matthew J Morse, Kanghwan Jang, Joseph Soriaga, Parham Noorzad, Hsin-Pai Cheng, Christopher Lott

Published: 01 Feb 2023, Last Modified: 23 Feb 2023ICLR 2023 posterReaders: Everyone

Keywords: transformer, efficient transformer, efficient attention

TL;DR: We propose an efficient Transformer based on composition of multi-scale attention with stacked slice representation and show that it outperforms the state-of-the-art efficient transformers in multiple benchmarks.

Abstract: Since the introduction of Transformers, researchers have tackled the notoriously expensive quadratic complexity problem. While significant computational efficiency improvements have been achieved, they come at the cost of reduced accuracy trade-offs. In this paper, we propose Composite Slice Transformer (CST), a Transformer-based network equipped with a composition of multi-scale multi-range attentions, boosting both efficiency and modeling capability. After stacking fixed-length slices of the input sequence, each layer in CST performs a pair of fine-and-coarse-grained attentions with short-long ranges in a sequential manner, coupled with volatile instant positional embedding, enabling efficient token interactions {\em and} improving expressiveness of the model. In addition to significantly reduced $O(NL+N^2/L^2)$ complexity for sequence length $N$ and slice length $L$, CST achieves superior performance on a variety of tasks. We show that CST surpasses recently published efficient Transformers on the Long Range Arena benchmark, demonstrating the bidirectional long-range dependency modeling capability of our model. It outperforms the standard Transformer by a margin of $6.9$\% in average accuracy across the five classification tasks of the benchmark, while being of complexity comparable to other efficient transformers. Furthermore, on the word-level autoregressive language modeling task with the WikiText-103 dataset, CST performs competitively against the Transformer model with only $2$\% gap in the test perplexity while outperforming other efficient Transformers.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Submission Guidelines: Yes

Please Choose The Closest Area That Your Submission Falls Into: Deep Learning and representational learning

10 Replies

Loading