- Keywords: fine-tuning, information bottleneck, large-scale language models
- TL;DR: proposing to use information bottleneck to reduce over-fitting when fine-tuning large-scale language models on low-resource datasets.
- Abstract: Large-scale pre-trained language models act as general-purpose feature extractors, but not all the features are relevant for a given target task. This can cause problems in low-resource scenarios, where fine-tuning such large-scale models often over-fits on the small training set. We propose to use the information bottleneck principle to improve generalization in this scenario. We apply the variational information bottleneck method to remove task-irrelevant and redundant features from sentence embeddings during the fine-tuning of BERT. Evaluation on seven low-resource datasets for different tasks shows that our method significantly improves transfer learning in low-resource scenarios and obtains better generalization on 11 out of 13 out-of-domain textual entailment datasets.