FOSL: A Foldable Sparse-and-Low-Rank Method for Efficient LLM Pre-training

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: efficient pre-training; low-rank adaption; structured sparsity;large language models;model folding
TL;DR: FOSL trains LLMs efficiently by keeping full-width activations while decoupling compute from width via two paths—a compact low-rank path and a folded sparse path that reuses channels.
Abstract: We propose FOSL, a foldable, sparse-and-low-rank reparameterization for efficient pre-training that decouples compute from width. Each linear/FFN/attention projection is rewritten as two cooperating paths: a low-rank path that injects expressive features via a compact adapter, and a folded sparse path that computes only a subset of output channels and synthesizes the remainder as virtual channels by reusing computed ones. A lightweight, variance-preserving rescaling keeps activations stable when channels are reused multiple times. This design delivers the benefits of a narrower network internally while maintaining full-width activations at the interface, avoiding the representation bottlenecks of hard pruning and complementing low-rank-only approaches. We evaluate FOSL for LLM pre-training across model scales from 60M to 7B parameters. Our experiments demonstrate that FOSL matches or surpasses full-rank models, while substantially reducing memory and computation costs.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 11777
Loading