Breaking the Vendor Lock: Performance Portable Programming through OpenMP as Target Independent Runtime Layer
Abstract: High performance computing (HPC) systems pervasively feature GPU accelerators. For maximum efficiency, these are usually programmed using vendor-specific languages, such as CUDA. However, this is not portable and leads to vendor lock-in. Existing portable proramming models require transcribing the whole application, which is tedious and often results in sub-optimal performance without necessarily avoiding the need to maintain multiple versions. Although solutions for automated translation exist, they sacrifice either features of the original model, performance, or both. We propose a novel compiler-based approach for performance portable programming of GPUs by generating portable code from the original, vendor-specific application source. Specifically, we present LLVM/Clang extensions for performance portable CUDA by leveraging the existing LLVM/OpenMP offloading infrastructure for portable execution on different GPU architectures. Our contributions include: re-designing the compiler driver for portable toolchain generation, defining a target independent math library, and re-architecting compiler lowering from CUDA APIs to existing and new OpenMP runtime calls. We evaluate our approach using six established CUDA proxy and benchmark applications first on NVIDIA GPUs, to measure the overhead of our portability layer, then secondly on AMD GPUs, to determine the efficacy of our approach. In both experiments we compare the performance to native program versions, i.e., CUDA and HIP. Our approach has minimal overhead compared to non-portable alternatives, thus providing viable performance portability for existing code without cost to the user. We further show CUDA code debugged directly on the host.
0 Replies
Loading