Abstract: Spiking Neural Networks (SNNs) are amenable to deployment on edge devices and neuromorphic hardware due to their lower dissipation. Recently, SNN-based transformers have
garnered significant interest, incorporating attention mechanisms akin to their counterparts
in Artificial Neural Networks (ANNs) while demonstrating excellent performance. However,
deploying large spiking transformer models on resource-constrained edge devices such as
mobile phones, still poses significant challenges resulted from the high computational demands of large uncompressed high-precision models. In this work, we introduce a novel
heterogeneous quantization method for compressing spiking transformers through layer-wise
quantization. Our approach optimizes the quantization of each layer using one of two distinct
quantization schemes, i.e., uniform or power-of-two quantification, with mixed bit resolutions.
Our heterogeneous quantization demonstrates the feasibility of maintaining high performance
for spiking transformers while utilizing an average effective resolution of 3.14-3.67 bits with
less than a 1% accuracy drop on DVS Gesture and CIFAR10-DVS datasets. It attains a
model compression rate of 8.71×-10.19× for standard floating-point spiking transformers.
Moreover, the proposed approach achieves a significant energy reduction of 5.69×, 8.72×,
and 10.2× while maintaining high accuracy levels of 85.3%, 97.57%, and 80.4% on the
N-Caltech101, DVS-Gesture, and CIFAR10-DVS datasets, respectively.
Loading