Kernel fusion in atomistic spin dynamics simulations on Nvidia GPUs using tensor core

Hongwei Chen

Published: 11 Jun 2024, Last Modified: 18 May 2025Journal of computational scienceEveryoneCC BY-NC-ND 4.0

Abstract: In atomistic spin dynamics simulations, the time cost of constructing the space- and time-displaced pair correlation function in real space increases {quadratically} as the number of spins $N$, leading to significant computational effort. The GEMM subroutine can be adopted to accelerate the calculation of the dynamical spin-spin correlation function, but the computational cost of simulating large spin systems ($>40000$ spins) on CPUs remains expensive. In this work, we perform the simulation on a graphics processing unit (GPU), a hardware solution widely used as an accelerator for scientific computing and deep learning. We show that GPUs can accelerate the simulation up to 25-fold compared to multi-core CPUs when using the GEMM subroutine on both. To hide memory latency, we fuse the element-wise operation into the GEMM kernel using $\mathtt{CUTLASS}$ which can improve the performance by 26\% $\sim$ 33\% compared to the implementation based on $\mathtt{cuBLAS}$. Furthermore, we perform the `on-the-fly' calculation in the epilogue of the GEMM subroutine to avoid saving intermediate results on global memory, which makes large-scale atomistic spin dynamics simulations feasible and affordable.