Transformer-QL: A Step Towards Making Transformer Network Quadratically LargeDownload PDF

Sep 28, 2020 (edited Mar 05, 2021)ICLR 2021 Conference Blind SubmissionReaders: Everyone
  • Reviewed Version (pdf): https://openreview.net/references/pdf?id=phlH-YRmsC
  • Keywords: deep learning, language model, transformer network, multi-scale transformer network, natural language processing, transformer-xl
  • Abstract: Transformer networks have shown outstanding performance on many natural language processing tasks. However the context length (the number of previous tokens on which the output states depend) of a Transformer network grows at best linearly with the memory and computational power used. This limitation prevents a transformer network to have very long context in a resource limited application. In this work, we propose a class of transformer networks, namely Transformer-QL (\bf{Q}uadratically \bf{L}arge), in which, the context length can grow at best quadratically with the memory and computational power used. We have empirically evaluated a Transformer-QL model in three long range language modeling datasets. The results show that Transformer-QL can provide significant improvements over other state of the art networks.
  • One-sentence Summary: In this paper, we propose a novel transformer architecture in which the context length (the number of past tokens on which the output states depend) can grow at best quadratically with the memory and computational usage.
  • Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
  • Supplementary Material: zip
5 Replies

Loading