Towards Large Scale Training on Apple Silicon

Published: 11 Jun 2025, Last Modified: 10 Jul 2025ES-FoMo IIIEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Apple Silicon, Distributed Training, Second-order optimizers
Abstract: Training large deep learning models is predominantly done in data centers with NVIDIA GPUs, which are unavailable to most researchers. In this paper, we explore the feasibility of training large language models (LLMs) on clusters of consumer hardware, particularly Apple devices. Compared to NVIDIA GPUs, a cluster of Apple devices has substantially more VRAM, fewer FLOPS, unified memory, and poor bandwidth between nodes. To address these unique hardware constraints, we introduce three key innovations: (1) KPOP, an optimizer that employs Adam in the Kronecker-factored eigenbasis (KFE), enabling efficient training on each node. While this requires more VRAM than AdamW, it outperforms it; (2) an extension of the optimizer for low-bandwidth environments using top eigenvalues; and (3) parallel usage of CPU and GPU, fully leveraging unified memory. We provide an extensive evaluation of the proposed methodological advancements, in some cases even outperforming state-of-the-art optimizers such as SGD and Adam in standard non-Apple training settings. Finally, by combining these techniques, we demonstrate effective training of LLMs on clusters ranging from 2 to 16 Macs.
Submission Number: 123
Loading