Keywords: 2:4 activation sparsity; Low-rank model; Large Language Models
TL;DR: This paper proposes a 2:4 activation sparsity framework for low-rank language models that accelerates training and inference while reducing memory consumption with minimal performance loss.
Abstract: Large Language Models (LLMs) have achieved remarkable capabilities, but their immense computational demands during training remain a critical bottleneck for widespread adoption. Low-rank training has received attention in recent years due to its ability to significantly reduce training memory usage. Meanwhile, applying 2:4 structured sparsity to weights and activations to leverage NVIDIA GPU support for 2:4 structured sparse format has become a promising direction. To achieve efficient pre-training of LLMs, this paper proposes ELAS: $\textbf{E}$fficient pre-training of $\textbf{L}$ow-rank LLMs via 2:4 $\textbf{A}$ctivation $\textbf{S}$parsity, a novel framework for low-rank models via 2:4 activation sparsity. ELAS applies squared ReLU activation functions to the feed-forward networks in low-rank models and implements 2:4 structured sparsity on the activations after the squared ReLU operation. We evaluated ELAS through pre-training experiments on LLaMA models. The results demonstrate that ELAS maintains performance with minimal degradation after applying 2:4 activation sparsity, while achieving training and inference acceleration. Moreover, ELAS reduces activation memory overhead—particularly with large batch sizes. Code will be made available.
Primary Area: generative models
Submission Number: 14387
Loading