Sparse and Wide Linear RNNs Are at the Efficiency-Performance Pareto Front

Published: 05 Mar 2025, Last Modified: 16 Apr 2025SLLMEveryoneRevisionsBibTeXCC BY 4.0
Track: long paper (up to 4 pages)
Keywords: Unstructured Sparsity, Pruning, ReLU, Quantization, Neuromorphic Hardware, RNNs, SSMs
Abstract: Linear recurrent neural networks enable powerful long-range sequence modeling with constant memory usage and time-per-token during inference. These architectures hold promise for streaming applications at the edge, but deployment in resource-constrained environments requires hardware-aware optimizations to minimize latency and energy consumption. In this paper, we investigate the effectiveness of unstructured sparsity--both in weights and activations--at reducing the computational demand of linear RNNs, as well as its combination with quantization. We find that highly sparse linear RNNs consistently achieve better efficiency-performance trade-offs than dense baselines, with $2\times$ less compute and $36$% less memory at iso-accuracy, and quantizing a sparse-and-wide network leads to lower performance degradation. When quantized to fixed-point arithmetic and deployed on the Intel Loihi 2 neuromorphic chip, sparse models demonstrate $42 \times$ lower latency and $149\times$ lower energy consumption compared to an iso-accuracy dense model on an edge GPU, providing hardware validation to the theoretical gains of unstructured sparsity.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 84
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview