Leveraging Low-Rank Structure for Effective Weight-Sharing in Language Models

Published: 02 Mar 2026, Last Modified: 02 Mar 2026Sci4DL 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Language Models, Weight Sharing, LoRA
Abstract: Small language models are often built by scaling down standard large language model architectures. We argue that this design choice is suboptimal, and that small models can be parameterized more effectively by sharing weights, with differences captured by low-rank adaptation (LoRA) modules. We test this hypothesis by comparing several weight-tying strategies. We find that attention matrices, and even entire layers, can be shared without degrading performance. This increases FLOPs per parameter, reduces optimizer-state memory, and improves over parameter-matched baselines. We also reduce the parameter count of the embedding layer via a factorized construction, which yields additional memory savings. To motivate these design choices, we analyze the effective rank of model weights and the residual stream. Our analysis leads to more efficient compact models.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Style Files: I have used the style files.
Challenge: This submission is an entry to the science of DL improvement challenge.
Submission Number: 88
Loading