Abstract: As the size of language models increases, they deliver substantial performance improvements across a variety of applications. However, this growth also leads to greater computational demands, making deployment on resource-constrained devices—such as personal computers and mobile or wearable devices—more challenging, and significantly raising inference costs on cloud servers. To address these challenges, we introduce Basel, a method to streamline language models by leveraging the semantic structure of their weight matrices. Specifically, Basel treats each weight matrix as a linear combination of bases, selectively retaining those that are associated with essential semantics for the target application, pruning redundant ones, and introducing new bases that enhance task performance. Experimental results demonstrate that Basel achieves significant model size reduction compared to baseline techniques, while maintaining comparable or even superior accuracy across diverse applications.
Submission Length: Long submission (more than 12 pages of main content)
Changes Since Last Submission: We sincerely thank the editor for overseeing the review process and the reviewers for their thorough feedback. In this revised version of the paper, we have carefully addressed all requested changes, as detailed in the responses below.
The major updates in this version are as follows:
- Weakened claims regarding the interpretation of semantic bases.
- Added a new task (WikiText) to broaden evaluation.
- Introduced a new low-rank compression baseline (SVD-LLM).
- Added a new pruning baseline (Wanda) and compared Basel with pruning methods under both quantized and unquantized settings.
- Provided the hyperparameters used in all experiments.
- Included an additional ablation study on the effect of L1 regularization.
Additional changes are also highlighted in our point-by-point response to the reviewers.
Assigned Action Editor: ~Quanquan_Gu1
Submission Number: 5486
Loading