Abstract: The CSB_Coo sparse matrix format is especially useful in situations such as eigenvalue problems where efficient SPMV and transposed SPMV_T operations are required. One strategy to increase the arithmetic intensity of large scale parallel solvers is to use a blocked eigensolver such LOBPCG and to operate on blocks of vectors to achieve greater performance. However, this solution is not always practical as MPI communication may be higher leading to inefficiencies or the increased memory usage of dense vectors may be impractical. Additionally the Lanczos algorithm is well tested in production and may be preferred in some situations. On modern architectures vectorization is key for obtaining good performance. In this paper we show the performance optimization and benefits of vectorization with AVX-512 Conflict Detection (CD) instructions in the case of a standard SPMV operation on a single vector. We also present a modified version of the CSB_Coo format which allows more efficient vector operations. We compare and analyze performance on Haswell, Xeon Phi (KNL and KNM) and Intel Xeon Scalable processors (Skylake).
Loading