NVDSL: Simplifying Tensor Cores with Python-Driven MLIR Metaprogramming

Published: 21 Jun 2024, Last Modified: 26 Jul 2024ES-FoMo-II 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: nvidia, gpu, compiler, mlir
TL;DR: A python embedded DSL simplifies GPU tensor core programming without sacrificing the from full control
Abstract: Exploiting the formidable computational capabilities of modern GPU tensor cores remains a challenging endeavor for developers. Existing programming models like CUDA and OpenCL are ill-suited for the non-SIMT nature of tensor cores, leaving a significant gap in the landscape of GPU programming languages. Vendors have primarily relied on library-based solutions or enhancements to mainstream machine learning frameworks, sacrificing the fine-grained control once afforded by CUDA in the SIMT era. In this paper, we introduce NVDSL, a Python-embedded domain-specific language that is based on MLIR compiler. NVDSL abstracts away the intricate details of tensor core programming. It allows programmers to efficiently program Hopper's Warpgroup (128 threads or 4 warps), enabling users to express sophisticated algorithms, such as multistage and warp specialization, with remarkable simplicity. We demonstrate its efficacy through two optimized GEMM kernels that achieve cuBLAS-like performance with remarkable code clarity. It is publicly available in upstream MLIR. The work is presented in EuroLLVM24 https://www.youtube.com/watch?v=V3Q9IjsgXvA.
Submission Number: 74
Loading