EQuARX: Efficient Quantized AllReduce in XLA for Distributed Machine Learning Acceleration

Ibrahim Ahmed; Clemens JS Schaefer; Gil Tabak; Zenong Zhang; Felix Chern; Anatoliy Yevtushenko; Andy Davis

EQuARX: Efficient Quantized AllReduce in XLA for Distributed Machine Learning Acceleration

Ibrahim Ahmed, Clemens JS Schaefer, Gil Tabak, Zenong Zhang, Felix Chern, Anatoliy Yevtushenko, Andy Davis

Published: 21 May 2025, Last Modified: 21 Jun 2025MLArchSys 2025 OralEveryoneRevisionsBibTeXCC BY 4.0

Presentation: In-Person

Keywords: Quantized AllReduce, Collective Acceleration, LLM, XLA

Presenter Full Name: Ibrahim Ahmed

TL;DR: This work introduces EQuARX, a native dynamic block-wise efficient quantized AllReduce within the XLA compiler for TPUs, which accelerates Gemma 3 prefill stages by up to 1.25× with negligible quality loss..

Presenter Email: ibahmed@google.com

Abstract: While Large Language Models (LLMs) have become highly influential, their enormous scale presents significant deployment challenges. Efficiently serving these models typically requires distributing them across numerous accelerator devices, which introduces substantial performance overhead from inter-device communication (collectives). While model quantization has been widely adopted to reduce the memory and compute requirements of LLM weights and activations with minimal quality impact, applying quantization directly to collectives like AllReduce is inherently difficult due to the inter-device summation involved, which can lead to numerical instability or significant error accumulation. In this work, we present a native dynamic block-wise efficient quantized AllReduce within the XLA compiler for TPUs (EQuARX). By using TPU-friendly quantization and deep pipelining of communication and compute, EQuARX with int8 precision achieves a 1.8X speedup over baseline BF16 AllReduce across various network topologies. Furthermore, EQuARX accelerates the prefill stage of Gemma 3 27B by 1.25X and Gemma 3 12B by 1.1X, respectively, with small to negligible impact on quality.

Presenter Bio: Dr. Ibrahim Ahmed's work lies at the intersection of hardware and software, focusing on performance optimization for machine learning. His doctoral research at the University of Toronto centered on enhancing the compute efficiency of FPGAs. He has since applied his expertise in hardware-software co-design to accelerate ML workloads running on LPUs at Groq. He is currently an XLA:TPU compiler engineer at Google focusing on optimizing performance of distributed ML on TPUs.

Paper Checklist Guidelines: I certify that all co-authors have validated the presented results and conclusions, and have read and commit to adhering to the Paper Checklist Guidelines, Call for Papers and Publication Ethics.

YouTube Link: Will add later

YouTube Link Poster: NA

Dataset Release: I certify that all co-authors commit to release the dataset and necessary scripts to reproduce the presented results.

Google Slides: Will add later

Poster: No

Workshop Registration: Yes, the presenter has registered for the workshop.

YouTube Link Short: Will add later

Submission Number: 9

Loading