How fine can fine-tuning be? Learning efficient language modelsDownload PDF

25 Sep 2019 (modified: 24 Dec 2019)ICLR 2020 Conference Withdrawn SubmissionReaders: Everyone
  • Original Pdf: pdf
  • TL;DR: Sparsification as fine-tuning of language models
  • Abstract: State-of-the-art performances on language comprehension tasks are achieved by huge language models pre-trained on massive unlabeled text corpora, with very light subsequent fine-tuning in a task-specific supervised manner. It seems the pre-training procedure learns a very good common initialization for further training on various natural language understanding tasks, such that only few steps need to be taken in the parameter space to learn each task. In this work, using Bidirectional Encoder Representations from Transformers (BERT) as an example, we verify this hypothesis by showing that task-specific fine-tuned language models are highly close in parameter space to the pre-trained one. Taking advantage of such observations, we further show that the fine-tuned versions of these huge models, having on the order of $10^8$ floating-point parameters, can be made very computationally efficient. First, fine-tuning only a fraction of critical layers suffices. Second, fine-tuning can be adequately performed by learning a binary multiplicative mask on pre-trained weights, \textit{i.e.} by parameter-sparsification. As a result, with a single effort, we achieve three desired outcomes: (1) learning to perform specific tasks, (2) saving memory by storing only binary masks of certain layers for each task, and (3) saving compute on appropriate hardware by performing sparse operations with model parameters.
  • Keywords: language model, BERT, pre-trained, fine-tuning, sparse
1 Reply