Abstract: As the size of language models increases, they deliver substantial performance improvements across a variety of applications. However, this growth also leads to greater computational demands, making deployment on resource-constrained devices—such as personal computers and mobile or wearable devices—more challenging, and significantly raising inference costs on cloud servers. To address these challenges, we introduce Basel, a method to streamline language models by leveraging the semantic structure of their weight matrices. Our analysis reveals that the bases of these weight matrices encode distinct semantic components, some of which are redundant for specific target applications. Our approach identifies and removes these redundant bases, retaining only those carrying essential semantics, and introduces new bases that enhance performance for the target tasks. Evaluations show that our method achieves up to 2.7× greater model size reduction compared to state-of-the-art techniques while maintaining similar or superior accuracy across diverse applications.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Quanquan_Gu1
Submission Number: 5486
Loading