Reducing Transformer Depth on Demand with Structured DropoutDownload PDF

Published: 20 Dec 2019, Last Modified: 03 Apr 2024ICLR 2020 Conference Blind SubmissionReaders: Everyone
TL;DR: Layerdrop, a form of structured dropout that allows you to train one model at training time and prune to any desired depth at test time. You can also use this to train even deeper models.
Abstract: Overparametrized transformer networks have obtained state of the art results in various natural language processing tasks, such as machine translation, language modeling, and question answering. These models contain hundreds of millions of parameters, necessitating a large amount of computation and making them prone to overfitting. In this work, we explore LayerDrop, a form of structured dropout, which has a regularization effect during training and allows for efficient pruning at inference time. In particular, we show that it is possible to select sub-networks of any depth from one large network without having to finetune them and with limited impact on performance. We demonstrate the effectiveness of our approach by improving the state of the art on machine translation, language modeling, summarization, question answering, and language understanding benchmarks. Moreover, we show that our approach leads to small BERT-like models of higher quality than when training from scratch or using distillation.
Keywords: reduction, regularization, pruning, dropout, transformer
Code: [![Papers with Code](/images/pwc_icon.svg) 5 community implementations](
Data: [ELI5](, [GLUE](, [MRPC](, [MultiNLI](, [QNLI](, [SST](, [SST-2](, [WikiText-103](, [WikiText-2](
Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 3 code implementations](
Original Pdf: pdf
18 Replies