Abstract: As the size of language models increases, they deliver substantial performance improvements across a variety of applications. However, this growth also leads to greater computational demands, making deployment on resource-constrained devices—such as personal computers and mobile or wearable devices—more challenging, and significantly raising inference costs on cloud servers. To address these challenges, we introduce Basel, a method to streamline language models by leveraging the semantic structure of their weight matrices. Specifically, Basel treats each weight matrix as a linear combination of bases, selectively retaining those that are associated with essential semantics for the target application, pruning redundant ones, and introducing new bases that enhance task performance. Experimental results demonstrate that Basel achieves significant model size reduction compared to baseline techniques, while maintaining comparable or even superior accuracy across diverse applications.
Submission Length: Long submission (more than 12 pages of main content)
Changes Since Last Submission: Use the camera-ready template.
Add the link to the open-source code (at the end of the introduction section).
Code: https://github.com/Iowa-State-University-AI-System-Group/Basel
Assigned Action Editor: ~Quanquan_Gu1
Submission Number: 5486
Loading