ELAS: Efficient Pre-Training of Low-Rank Large Language Models via 2:4 Activation Sparsity

04 May 2026 (modified: 13 May 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large Language Models (LLMs) have achieved remarkable capabilities, but their immense computational demands during training remain a critical bottleneck for widespread adoption. Low-rank training has received attention in recent years due to its ability to significantly reduce training memory usage. Meanwhile, applying 2:4 structured sparsity to weights and activations to leverage NVIDIA GPU support for 2:4 structured sparse format has become a promising direction. However, existing low-rank methods often leave activation matrices in full-rank, which dominates memory consumption and limits throughput during large-batch training. Furthermore, directly applying sparsity to weights often leads to non-negligible performance degradation. To achieve efficient pre-training of LLMs, this paper proposes ELAS: \textbf{E}fficient pre-training of \textbf{L}ow-rank LLMs via 2:4 \textbf{A}ctivation \textbf{S}parsity, a novel framework for low-rank models via 2:4 activation sparsity. ELAS applies squared ReLU activation functions to the feed-forward networks in low-rank models and implements 2:4 structured sparsity on the activations after the squared ReLU operation. We evaluated ELAS through pre-training experiments on LLaMA models \textcolor{red}{ranging from 60M to 1B parameters}. The results demonstrate that ELAS maintains performance with minimal degradation after applying 2:4 activation sparsity, while achieving training and inference acceleration. Moreover, ELAS reduces activation memory overhead—particularly with large batch sizes. Code will be made available.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Vimal_Thilak2
Submission Number: 8769
Loading