Abstract: Gradient compression technology has attracted much attention in recent years, due to its high effectiveness in alleviating the communication bottleneck of distributed deep learning. However, except for the communication reduction, it also brings in a significant increase of computational overhead, which limits or even eliminates the communication-reduction benefit brought by gradient compression. To solve the high computational overhead problem, we propose an FPGA-based accelerator for gradient compression in this paper. A high-performance and programmable accelerator architecture is developed for accelerating various gradient compression algorithms by offloading compute-intensive compression operations to FPGA. Also, we design and implement the FPGA-based accelerator based on the popular gradient compression algorithm top-k sparsification. Experimental results show that the new accelerator achieves up to hundreds of times faster than the compression algorithm implemented on CPU and GPU. What's more, the stable and controllable performance under different datasets demonstrates that the proposed accelerator is insensitive to data distribution, which is essential for time-sensitive applications.
0 Replies
Loading