Embedding Recycling for Language Models

Embedding Recycling for Language Models

TMLR Paper330 Authors

02 Aug 2022 (modified: 17 Sept 2024)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Training and inference with large neural models is expensive. However, for many application domains, while new tasks and models arise frequently, the underlying documents being modeled remain mostly unchanged. We study how to decrease computational cost in such settings through embedding recycling (ER): re-using activations from previous model runs during training or inference. In contrast to prior work focusing on freezing small classification heads for fine-tuning which often leads to notable drops in accuracy, we propose caching an intermediate layer’s output from a pretrained model and fine-tuning the remaining layers for new tasks. We show that our method is effective using either fine-tuning for the trainable layers, or parameter-efficient adapters. For the best-performing model in our experiments, DeBERTa-v2 XL with adapters, we find that our method provides a 100% speedup during training and an 87-91% speedup for inference, and has negligible impacts on accuracy averaged across eight tasks spanning text classification and entity recognition in the scientific domain and general-domain question answering. Further, in experiments with SciBERT, BERT-base, and RoBERTa-large, we show a 100% speedup during training and a 55-86% speedup for inference, at only a 0.19-0.23% reduction in accuracy on average. Finally, we identify several open challenges and future directions for ER.

Submission Length: Regular submission (no more than 12 pages of main content)

Changes Since Last Submission: N/A

Assigned Action Editor: ~Alessandro_Sordoni1

Submission Number: 330

Loading