Keywords: nvidia, gpu, compiler, mlir
TL;DR: A python embedded DSL simplifies GPU tensor core programming without sacrificing the from full control
Abstract: Exploiting the formidable computational capabilities of modern GPU tensor cores remains a challenging endeavor for developers. Existing programming models like CUDA and OpenCL are ill-suited for the non-SIMT nature of tensor cores, leaving a significant gap in the landscape of GPU programming languages. Vendors have primarily relied on library-based solutions or enhancements to mainstream machine learning frameworks, sacrificing the fine-grained control once afforded by CUDA in the SIMT era.
In this paper, we introduce NVDSL, a Python-embedded domain-specific language that is based on MLIR compiler. NVDSL abstracts away the intricate details of tensor core programming. It allows programmers to efficiently program Hopper's Warpgroup (128 threads or 4 warps), enabling users to express sophisticated algorithms, such as multistage and warp specialization, with remarkable simplicity. We demonstrate its efficacy through two optimized GEMM kernels that achieve cuBLAS-like performance with remarkable code clarity. It is publicly available in upstream MLIR. The work is presented in EuroLLVM24 https://www.youtube.com/watch?v=V3Q9IjsgXvA.
Submission Number: 74
Loading