Abstract: Low-precision quantization and sparsity have been widely explored in CNN acceleration due to their effectiveness in reducing computational complexity and memory requirements. However, to support variable numerical precision and sparse computation, prior accelerators design flexible multipliers or sparse dataflow separately. A uniform solution that simultaneously exploits mixed-precision and dual-sided irregular sparsity for CNN acceleration is still lacking. Through an in-depth review of existing precision-scalable and sparse accelerators, we observe that a direct combination of low-level multipliers and high-level sparse dataflow from both sides is challenging due to their orthogonal design spaces. To this end, in this paper, we propose condensed streaming computation. By representing non-zero weights and activations as atomized streams, the low-level mixed-precision multiplication and high-level sparse convolution can be unified into a shared dataflow through hierarchical data reuse. Based on the condensed streaming computation, we propose Ristretto, an atomized architecture that exploits both mixed-precision and dual-sided irregular sparsity for CNN inference. We implement Ristretto in a 28nm technology node. Extensive evaluations show that Ristretto consistently outperforms three state-of-the-art CNN accelerators, including Bit Fusion, Laconic, and SparTen, in terms of performance and energy efficiency.
0 Replies
Loading