Abstract: Kernels are scheduled on Graphics Processing Units (GPUs) in the granularity of GPU warp, which is a bunch of threads that must be scheduled together. When executing kernels with conditional branches, the threads within a warp may execute different branches sequentially, resulting in a considerable utilization loss and unpredictable execution time. This problem is known as the control flow divergence. In this work, we propose a novel method to predict threads' execution path before the launch of the kernel by deploying a branch prediction network on the GPU's tensor cores, which can efficiently parallel run with the kernels on CUDA cores, so that the divergence problem can be eased in a large extent with the lowest overhead. Combined with a well-designed thread data reorganization algorithm, this solution can better mitigate GPUs' control flow divergence problem.
Loading