Automatic Generation of Distributed-Memory Mappings for Tensor Computations

Martin Kong, Raneem Abu Yosef, Atanas Rountev, P. Sadayappan

Published: 01 Jan 2023, Last Modified: 13 Nov 2024SC 2023EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: While considerable research has been directed at automatic parallelization for shared-memory platforms, little progress has been made in automatic parallelization schemes for distributed-memory systems. We introduce an innovative approach to automatically produce distributed-memory parallel code for an important subclass of affine tensor computations common to Coupled Cluster (CC) electronic structure methods, neuro-imaging applications, and deep learning models.We propose a novel systematic approach to modeling the relations and trade-offs of mapping computations and data onto multidimensional grids of homogeneous nodes. Our formulation explores the space of computation and data distributions across processor grids. Tensor programs are modeled as a non-linear symbolic formulation accounting for the volume of data communication and per-node capacity constraints induced under specific mappings. Solutions are found, iteratively, using the Z3 SMT solver, and used to automatically generate efficient MPI code. Our evaluation demonstrates the effectiveness of our approach over Distributed-Memory Pluto and the Cyclops Tensor Framework.