Leveraging Low-Rank Structure for Effective Weight-Sharing in Language Models

Published: 02 Mar 2026, Last Modified: 26 Apr 2026Sci4DL 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Language Models, Weight Sharing, LoRA
Abstract: Small language models are typically built by heuristically scaling down the architectures of large language models. We investigate whether small models can be parameterized more effectively by sharing weights across attention heads and transformer layers while capturing their differences with low rank adaptation modules. To understand the limits and tradeoffs of this approach, we conduct controlled pretraining experiments that compare several weight sharing strategies under strict parameter-matched constraints across four model scales from 100M to 1B parameters. We find that attention matrices, and even entire transformer layers, can be shared without degrading performance, though overly aggressive sharing configurations yield diminishing or negative returns. Within the effective sharing regime, weight sharing deliberately trades increased FLOPs per parameter for a reduced memory footprint, matching or improving over parameter-matched unshared baselines. We also explore reducing the parameter cost of the embedding layer through a factorized construction, which yields additional memory savings and enables more effective parameter allocation. To motivate these design choices, we analyze the effective rank of model weights and the residual stream. Our analysis, along with downstream evaluations, provides a recipe for designing more efficient compact models.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Style Files: I have used the style files.
Challenge: This submission is an entry to the science of DL improvement challenge.
Submission Number: 88
Loading