Track: long paper (up to 4 pages)
Keywords: Unstructured Sparsity, Pruning, ReLU, Quantization, Neuromorphic Hardware, RNNs, SSMs
Abstract: Linear recurrent neural networks enable powerful long-range sequence modeling with constant memory usage and time-per-token during inference.
These architectures hold promise for streaming applications at the edge, but deployment in resource-constrained environments requires hardware-aware optimizations to minimize latency and energy consumption.
In this paper, we investigate the effectiveness of unstructured sparsity--both in weights and activations--at reducing the computational demand of linear RNNs, as well as its combination with quantization.
We find that highly sparse linear RNNs consistently achieve better efficiency-performance trade-offs than dense baselines, with $2\times$ less compute and $36$% less memory at iso-accuracy, and quantizing a sparse-and-wide network leads to lower performance degradation.
When quantized to fixed-point arithmetic and deployed on the Intel Loihi 2 neuromorphic chip, sparse models demonstrate $42 \times$ lower latency and $149\times$ lower energy consumption compared to an iso-accuracy dense model on an edge GPU, providing hardware validation to the theoretical gains of unstructured sparsity.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 84
Loading