Parameter and Memory Efficient Language Model Compression using Fisher Information from Low-Rank Representations

Anonymous

Parameter and Memory Efficient Language Model Compression using Fisher Information from Low-Rank Representations

Anonymous

16 Feb 2024ACL ARR 2024 February Blind SubmissionReaders: Everyone

Abstract: Modern language models demonstrate excellent performance in diverse text processing tasks. Yet, to achieve the best quality, memory and computationally demanding fine-tuning on a downstream task is required. While PEFT methods, such as LoRA enable almost no VRAM overhead for fine-tuning, the amount of memory and compute may be still prohibitive for the regular users. To compress and speed up LMs pruning techniques, such as Fisher-Weighted Singular Value Decomposition (FWSVD) (https://arxiv.org/abs/2207.00112) are therefore additionally used. Yet, FWSVD requires a downstream task fine-tuning to gather Fisher information. Our work tries to break this vicious circle of dependence on large expensive GPU showing that state-of-the-art LM compression, such as FWSVD, can be done without storing the full gradients. Namely, our approach combines the reduced number of training parameters up to $0,01\%$ of the initial amount of parameters and the VRAM utilization up to 15\%, for a pruning $20\%$ of the fine-tuned model weights without any noticeable loss of accuracy. We evaluate this approach on various tasks including NLU, NER, MMLU, and summarization demonstrating the effectiveness of the method as compared to strong baselines.

Paper Type: long

Research Area: Efficient/Low-Resource Methods for NLP

Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Approaches low compute settings-efficiency

Languages Studied: English

0 Replies

Loading