Abstract: Deep learning (DL) mainly uses various parallel computing libraries to optimize the speed of model training. The underlying computations of the DL operators typically include essential functions such as reduction and prefix scan, the efficiency of which can be greatly improved using parallel acceleration devices. However, the acceleration of these computations is mainly supported by collective primitive libraries such as NVIDIA CUB and AMD hipCUB, which are only available on vendor-specific hardware accelerators due to the highly segregated computational ecology between different vendors. To address this issue, we propose an OpenCL parallel computing library called oclCUB that can run on different heterogeneous platforms. OclCUB abstracts the OpenCL execution environment, implements reusable common underlying computations of DL, and designs two types of interfaces targeting the operators' heterogeneous acceleration pattern, enabling users to design and optimize DL operators efficiently. We evaluate the oclCUB on various hardware accelerators across Nvidia Tesla V100s with OpenCL 1.2, AMD RADEON PRO V520 with OpenCL 2.0, MT-3000 with MOCL 3, and Kunpeng 920 with POCL 1.6. Our experiments show that the oclCUB-based operators achieve accurate computational results on various platforms. The results also demonstrate that oclCUB is able to maintain a smaller, acceptable performance gap with CUB, and comparable in performance to hipCUB.
Loading