Control Flow Divergence Optimization by Exploiting Tensor Cores

Weiguang Pang, Xu Jiang, Songran Liu, Lei Qiao, Kexue Fu, Longxiang Gao, Wang Yi

Published: 2024, Last Modified: 01 Aug 2025DAC 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Kernels are scheduled on Graphics Processing Units (GPUs) in the granularity of GPU warp, which is a bunch of threads that must be scheduled together. When executing kernels with conditional branches, the threads within a warp may execute different branches sequentially, resulting in a considerable utilization loss and unpredictable execution time. This problem is known as the control flow divergence. In this work, we propose a novel method to predict threads' execution path before the launch of the kernel by deploying a branch prediction network on the GPU's tensor cores, which can efficiently parallel run with the kernels on CUDA cores, so that the divergence problem can be eased in a large extent with the lowest overhead. Combined with a well-designed thread data reorganization algorithm, this solution can better mitigate GPUs' control flow divergence problem.