PIM-Opt: Demystifying Distributed Optimization Algorithms on a Real-World Processing-In-Memory System

Steve Rhyner; Haocong Luo; Juan Gómez-Luna; Mohammad Sadrosadati; Jiawei Jiang; Ataberk Olgun; Harshita Gupta; Ce Zhang; Onur Mutlu

PIM-Opt: Demystifying Distributed Optimization Algorithms on a Real-World Processing-In-Memory System

Steve Rhyner, Haocong Luo, Juan Gómez-Luna, Mohammad Sadrosadati, Jiawei Jiang, Ataberk Olgun, Harshita Gupta, Ce Zhang, Onur Mutlu

Published: 01 Jan 2024, Last Modified: 08 Feb 2025PACT 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Modern Machine Learning (ML) training on large-scale datasets is a very time-consuming workload. It relies on the optimization algorithm Stochastic Gradient Descent (SGD) due to its effectiveness, simplicity, and generalization performance (i.e., test performance on unseen data). Processor-centric architectures (e.g., CPUs, GPUs) commonly used for modern ML training workloads based on SGD are bottlenecked by data movement between the processor and memory units due to the poor data locality in accessing large training datasets. As a result, processor-centric architectures suffer from low performance and high energy consumption while executing ML training workloads. Processing-In-Memory (PIM) is a promising solution to alleviate the data movement bottleneck by placing the computation mechanisms inside or near memory. Several prior works propose PIM techniques to accelerate ML training; however, prior works either do not consider real-world PIM systems or evaluate algorithms that are not widely used in modern ML training.Our goal is to understand the capabilities and characteristics of popular distributed SGD algorithms on real-world PIM systems to accelerate data-intensive ML training workloads. To this end, we 1) implement several representative centralized parallel SGD algorithms, i.e., based on a central node responsible for synchronization and orchestration, on the real-world general-purpose UPMEM PIM system, 2) rigorously evaluate these algorithms for ML training on large-scale datasets in terms of performance, accuracy, and scalability, 3) compare to conventional CPU and GPU baselines, and 4) discuss implications for future PIM hardware. We highlight the need for a shift to an algorithm-hardware codesign to enable decentralized parallel SGD algorithms in real-world PIM systems, which significantly reduces the communication cost and improves scalability.Our results demonstrate three major findings: 1) The general-purpose UPMEM PIM system can be a viable alternative to state-of-the-art CPUs and GPUs for many memory-bound ML training workloads, especially when operations and datatypes are natively supported by PIM hardware, 2) it is important to carefully choose the optimization algorithms that best fit PIM, and 3) the UPMEM PIM system does not scale approximately linearly with the number of nodes for many data-intensive ML training workloads. We open source all our code to facilitate future research at https://github.com/CMU-SAFARI/PIM-Opt.

Loading