Efficient transfer learning for NLP with ELECTRADownload PDF

22 Jan 2021 (modified: 05 May 2023)ML Reproducibility Challenge 2020 Blind SubmissionReaders: Everyone
Keywords: ML, NLP, ELECTRA
TL;DR: Reproducibility of ELECTRA
Abstract: Scope of Reproducibility Clark et al. [2020] claims that the ELECTRA approach is highly efficient in NLP performances relative to computation budget. As such, this study focus on this claim, summarized by the following question: Can we use ELECTRA to achieve close to SOTA performances for NLP in low-resource settings, in term of compute cost? Methodology This replication study has been conducted by fully reimplementing the small variant of the original ELECTRA model (Clark et al. [2020]). All experiments are performed on single GPU computers. GLUE benchmark dev set (Wang et al. [2018]) is used for models evaluation and compared with the original paper. Results My results are similar to the original ELECTRA’s implementation (Clark et al. [2020]), despite minor differences compared to the original paper for both implementations. With only 14M parameters, ELECTRA outperforms, in absolute performances, concurrent pretraining approaches from some previous SOTA, such as GPT, or alternative efficient approaches using knowledge distillation, such as DistilBERT. By taking into account compute cost, ELECTRA is clearly outperforming all compared approaches, including BERT and TinyBERT. Therefore, this work supports the claim that ELECTRA achieves high level of performances in low-resource settings, in term of compute cost. Furthermore, with an increased generator capacity than recommended by Clark et al. [2020], the discriminant can collapses by being unable to distinguish if inputs are fake or not. Thus, while ELECTRA is easier to train than GAN (Goodfellow et al. [2014]), it appears to be sensitive to capacity allocation between generator and discriminator. The code and a pretrained model will be released. What was easy Information provided by the authors of the original paper (Clark et al. [2020]), either from the paper; within the source code: or from the official Github repository, is very rich and exhaustive to understand the proposed approach. In addition, as stated with their main claim, ELECTRA can be easily run on a single GPU, even with a smaller GPU memory than the ones used in their original paper. What was difficult By being an aggregation of several tasks and the variance from results, GLUE benchmark requires significant amount of effort, in term of implementation and computation. For models comparison, several tricks can also influenced the results, which is even more amplified by the different aggregation formulas and the lack of measure of dispersion for published results. As such, confirming the correctness of this reimplementation was harder than expected. Communication with original authors Kevin Clark, one of the original authors, has been helpful by answering some questions. Unfortunately, breakdown of GLUE score per tasks have not yet been provided to fully compare this implementation with the original one. Otherwise, most of questions that I had were already answered though the Github repository or by inspecting the source code.
Paper Url: https://openreview.net/forum?id=r1xMH1BtvB
6 Replies

Loading