Abstract: Highlights•Design and investigation of vector-based reduction operation for MPI reduction.•Implementation using Intel AVXs and Arm SVE to demonstrate the efficiency of our vectorized reduction operation.•Experiments with MPI benchmarks, performance tool, HPC and deep learning application.•Experiments with different architectures (x86 and aarch64) and processors including Intel Xeon Gold, AMD Zen 2, and Arm A64FX.
Loading