Abstract: Fast Fourier transform (FFT) is widely used in scientific and engineering computation. Recently developed matrix computation units for AI and high-performance computing provide new optimization opportunities for the FFT algorithm. Compared to dedicated matrix multiplication architectures like Intel AMXs, ARM’s Scalable Matrix Extension (SME) provides more flexible outer product instructions to construct matrix multiplications in software. To leverage ARM SME’s matrix multiplication capabilities, this paper proposes a novel optimized outer product pattern for the Cooley-Tukey FFT algorithm and presents OpenFFT-SME, the first FFT library for ARM SME based on this pattern. This pattern can reduce the number of outer product operations and memory accesses to the DFT matrix by exploiting the symmetric and periodic properties of twiddle factors in the DFT matrix. To further boost performance, OpenFFT-SME incorporates software pipelining to enhance execution pipelines in the assembly code kernels. Meanwhile, a butterfly network more suitable for this pattern is designed and integrated. Experiments demonstrate that OpenFFT-SME outperforms vectorization methods on ARM SME CPUs, achieving 3.60x (power of two) and 4.14x speedups (non-power of two) in double-precision compared to FFTW, 2.47x (power of two) and 3.21x (non-power of two) speedups in double-precision compared to FFTE, and speedups of 4.38x (power of two) and 7.02x (non-power of two) in single-precision compared to FFTW. Furthermore, we compare the advantages and disadvantages of our implementation against vectorization methods and analyze its performance characteristics through additional experiments.
Loading