Abstract: In MPI, collective routines MPI_Alltoall and MPI_Alltoallv play an important role in facilitating all-to-all inter-process data exchange. MPI_Alltoallv is a generalization of MPI_Alltoall, supporting the exchange of non-uniform distributions of data. Popular implementations of MPI, such as MPICH and OpenMPI, implement MPI_Alltoall using a combination of techniques such as the Spread-out algorithm and the Bruck algorithm. Spread-out has a linear complexity in P, compared to Bruck's logarithmic complexity (P: process count); a selection between these two techniques is made at runtime based on the data block size. However, MPI_Alltoallv is typically implemented using only variants of the spread-out algorithm, and therefore misses out on the performance benefits that the log-time Bruck algorithm offers (especially for smaller data loads).In this paper, we first implement and empirically evaluate all existing variants of the Bruck algorithm for uniform and non-uniform data loads-this forms the basis for our own Bruck-based non-uniform all-to-all algorithms. In particular, we developed two open-source implementations, padded Bruck and two-phase Bruck, that efficiently generalize Bruck algorithm to non-uniform all-to-all data exchange. We empirically validate the techniques on three supercomputers: Theta, Cori, and Stampede, using both microbenchmarks and two real-world applications: graph mining and program analysis. We perform weak and strong scaling studies for a range of average message sizes, degrees of imbalance, and distribution schemes, and demonstrate that our techniques outperform vendor-optimized Cray's MPI_Alltoallv by as much as 50% for some workloads and scales.
Loading