Abstract: With neural networks growing deeper and feature maps growing larger, limited communication bandwidth with external memory (or DRAM) and power constraints become a bottle-neck in implementing network inference on mobile and edge devices. In this paper, we propose an end-to-end differentiable bandwidth efficient neural inference method with the activation compressed by neural data compression method. Specifically, we propose a transform-quantization-entropy coding pipeline for activation compression with symmetric exponential Golomb coding and a data-dependent Gaussian entropy model for arithmetic coding. Optimized with existing model quantization methods, low-level task of image compression can achieve up to 19× bandwidth reduction with 6.21× energy saving. The code implementation is available at https://github.com/xyzysz/Bandwidth_efficient_nic.
External IDs:dblp:conf/icassp/YinXLWL0L24
Loading