Research on GPU transplantation optimization of PRM scalar advection scheme in GRAPES global forecast system

Zhangjie Tan, Jinfang Jia, Zhengsheng Ning, Jianqiang Huang, Xiaoying Wang

Published: 2025, Last Modified: 05 Jan 2026CCF Trans. High Perform. Comput. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: With the rise of large AI models, Graphics Processing Units (GPUs) have become the preferred hardware solution for many scientific applications due to their superior floating-point computation capabilities. This paper explores the application of CPU+GPU heterogeneous accelerators in the Global/Regional Assimilation and Prediction System (GRAPES). We moved the main time-consuming part of the scalar advection scheme (PRM) in the system to run on the GPU. Specifically, we performed a detailed performance analysis of the PRM module and then refactored and ported the code using C and CUDA C to run on the GPU. During this process, we used a series of optimization methods, including changing array storage order, optimizing GPU memory access, and merging loops to increase kernel function computation. Additionally, to reduce communication overhead, we designed a communication-avoidance scheme to improve performance. The final solution showed good accuracy within acceptable error margins and excellent scalability. On a cluster with Intel(R) Xeon(R) Gold 6326 CPUs and NVIDIA A800 GPUs, we achieved up to 87.90 times speedup for the hotspot function and 5.21 times overall speedup for the scalar advection scheme using 16 CPU cores and 8 GPU accelerators.

External IDs:dblp:journals/ccfthpc/TanJNHW25