Abstract: Spiking Neural Networks (SNNs) have gained significant research attention over the past decade due to their potential for enabling resource-constrained edge devices. While existing SNN accelerators efficiently process sparse spikes with dense weights, the opportunities for accelerating SNNs with sparse weights, referred to as dual-sparsity, remain underexplored. In this work, we focus on accelerating dual-sparse SNNs, particularly on their core operation: sparse-matrix-sparse-matrix multiplication (spMspM). Our observations reveal that executing a dual-sparse SNN on existing spMspM accelerators designed for dual-sparse Artificial Neural Networks (ANNs) results in sub-optimal efficiency. The main challenge is that SNNs, which naturally processes multiple timesteps, introducing an additional loop in ANN spMspM, leading to longer latency and more memory traffic. To address this issue, we propose a fully temporal-parallel (FTP) dataflow that minimizes data movement across timesteps and reduces the end-to-end latency of dual-sparse SNNs. To enhance the efficiency of the FTP dataflow, we introduce an FTP-friendly spike compression mechanism that efficiently compresses single-bit spikes and ensures contiguous memory access. Additionally, we propose an FTP-friendly inner-join circuit that reduces the cost of expensive prefix-sum circuits with negligible throughput penalties. These innovations are encapsulated in LoAS, a Low-latency inference Accelerator for dual-sparse SNNs. Running dual-sparse SNN workloads on LoAS demonstrates significant speedup (up to $8.51 \times$) and energy reduction (up to $3.68\times$) compared to prior dual-sparse accelerators.
Loading