Transformer-QL: A Step Towards Making Transformer Network Quadratically Large

Suvadeep Hajra

Transformer-QL: A Step Towards Making Transformer Network Quadratically Large

Suvadeep Hajra

28 Sept 2020 (modified: 05 May 2023)ICLR 2021 Conference Blind SubmissionReaders: Everyone

Keywords: deep learning, language model, transformer network, multi-scale transformer network, natural language processing, transformer-xl

Abstract: Transformer networks have shown outstanding performance on many natural language processing tasks. However the context length (the number of previous tokens on which the output states depend) of a Transformer network grows at best linearly with the memory and computational power used. This limitation prevents a transformer network to have very long context in a resource limited application. In this work, we propose a class of transformer networks, namely Transformer-QL (\bf{Q}uadratically \bf{L}arge), in which, the context length can grow at best quadratically with the memory and computational power used. We have empirically evaluated a Transformer-QL model in three long range language modeling datasets. The results show that Transformer-QL can provide significant improvements over other state of the art networks.

One-sentence Summary: In this paper, we propose a novel transformer architecture in which the context length (the number of past tokens on which the output states depend) can grow at best quadratically with the memory and computational usage.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Supplementary Material: zip

Reviewed Version (pdf): https://openreview.net/references/pdf?id=phlH-YRmsC

5 Replies

Loading