Keywords: Large Language Models, Distributed Training, Tensor-Parallelism
TL;DR: Training LLM’s with tensor-parallelism without completely synchronizing activations to accelerate training and inference.
Abstract: Training and inference of Large Language Models (LLMs) with tensor-parallelism
requires substantial communication to synchronize activations. Our findings suggest
that with a few minor adjustments to current practices, LLMs can be trained
without fully synchronizing activations, reducing bandwidth demands. We name
this “Communication-Aware Architecture for Tensor-parallelism” (CAAT-Net).
We train a 7B parameter CAAT-Net model and show that tensor-parallel communication
can be reduced by up to 50% with no significant drop in pretraining accuracy
across nearly all evaluated benchmarks. We also experiment with smaller 130M
and 1.1B models to show the robustness and scalability of our method. We find that,
in some scenarios, validation loss can even improve when reducing communication.
Finally, we demonstrate how CAAT-Net accelerates both training and inference
workloads across various settings and model sizes.
Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)
Submission Number: 20813
Loading