Faster Ternary and Binary Neural Network Inference on CPU by Reducing Popcount Overhead

Published: 2025, Last Modified: 28 Jan 2026ISLPED 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Quantization is a widely adopted method of reducing resource consumption of neural network models while maintaining good model accuracy. Ternary and Binary Neural Networks (TNNs and BNNs) can be implemented by lightweight bitwise operations and are thus very suitable for edge platforms. Existing efforts mainly optimize the bitwise computation algorithms for BNN inference. However, TNNs and mixed-precision Ternary-Binary Neural Networks (TBNs and BTNs) still lack optimized computing libraries on AVX2 and ARM CPUs. Their data preparation walks through the data multiple times, resulting in low data locality. Moreover, the popcount accounts for up to 28% of total operations in bitwise matrix multiplication, but the popcount has throughput of only 1 and no SIMD instructions in AVX2, becoming the central performance bottleneck.In this paper, we propose a faster inference method for TNNs, TBNs, and BTNs on AVX2 and ARM CPUs. First, we optimize the data preparation stage by fusing the quantization, bit-packing, and image-to-row into one loop to improve the data locality. Second, we propose an efficient bitwise matrix multiplication algorithm for AVX2 by replacing the low-throughput popcount instructions with high-throughput SIMD instructions and applying new data encoding. This algorithm reduces the total instruction count by 15% and brings 2.2× theoretical speedup. Third, we implement a fast C++ inferen ce library for TNNs, TBNs, and BTNs with standard optimizations like blocking and loop unrolling. Benchmarking results show that our new matrix multiplication algorithm is up to 2.1× faster than the related work TAB on AVX2 CPUs. We further achieve layer-level speedup of up to 2.7× on AVX2 and 2.3× on ARM over the baseline for TNNs, TBNs, and BTNs. Moreover, we achieve 1.3-1.9× end-to-end speedup and 1.2-1.8× energy efficiency compared to TAB on Resnet, Darknet, and VGG models.
Loading