Sparse and Wide Linear RNNs Are at the Efficiency-Performance Pareto Front

Alessandro Pierro; Steven Abreu; Jonathan Timcheck; Philipp Stratmann; Sumit Bam Shrestha

Sparse and Wide Linear RNNs Are at the Efficiency-Performance Pareto Front

Alessandro Pierro, Steven Abreu, Jonathan Timcheck, Philipp Stratmann, Sumit Bam Shrestha

Published: 05 Mar 2025, Last Modified: 16 Apr 2025SLLMEveryoneRevisionsBibTeXCC BY 4.0

Track: long paper (up to 4 pages)

Keywords: Unstructured Sparsity, Pruning, ReLU, Quantization, Neuromorphic Hardware, RNNs, SSMs

Abstract: Linear recurrent neural networks enable powerful long-range sequence modeling with constant memory usage and time-per-token during inference. These architectures hold promise for streaming applications at the edge, but deployment in resource-constrained environments requires hardware-aware optimizations to minimize latency and energy consumption. In this paper, we investigate the effectiveness of unstructured sparsity--both in weights and activations--at reducing the computational demand of linear RNNs, as well as its combination with quantization. We find that highly sparse linear RNNs consistently achieve better efficiency-performance trade-offs than dense baselines, with $2\times$ less compute and $36$% less memory at iso-accuracy, and quantizing a sparse-and-wide network leads to lower performance degradation. When quantized to fixed-point arithmetic and deployed on the Intel Loihi 2 neuromorphic chip, sparse models demonstrate $42 \times$ lower latency and $149\times$ lower energy consumption compared to an iso-accuracy dense model on an edge GPU, providing hardware validation to the theoretical gains of unstructured sparsity.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Submission Number: 84

Loading