Improving CUDA performance of an unstructured high-order CFD application under OP2 framework

Published: 01 Jan 2024, Last Modified: 17 Apr 2025J. Supercomput. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: OP2 is a domain-specific language-based programming framework for unstructured mesh applications. It supports automatic code generation targeting multiple parallel modes, with CUDA included. However, using OP2 to generate efficient CUDA code for real-world applications is a challenging task. This paper reports our efforts optimizing the CUDA code performance when refactoring an unstructured high-order CFD application (namely HOUR2D) based on OP2. A series of novel methods are realized, including utilizing appropriate execution strategies, using local arrays, and optimizing the OP2 data transfer function, etc. Performance evaluation shows that our optimizations significantly improve the performance of the finally generated CUDA code. The overall performance of our optimized OP2-CUDA code is 13.2 times higher than the unoptimized OP2-CUDA code and 2.4 times higher than the manual CUDA code. Meanwhile, these optimizations do not affect the portability of HOUR2D as an OP2 application.
Loading