Keywords: Transformers, Training methodology, Multi-level parallelism, Neural ODE, Parallel-in-time algorithm, BERT, Vision Transformer (ViT), Machine translation, GPT2
TL;DR: We apply a novel layer-parallel training paradigm to transformer models which scales with network depth.
Abstract: We present a new training methodology for transformers using a multi-level layer-parallel approach. Through a neural ODE formulation of transformers, our application of a multilevel parallel-in-time algorithm for the forward and back propagation phases of training achieves parallel acceleration over the layer dimension. This dramatically enhances parallel scalability as the network depth increases, particularly useful in large foundational models. However, achieving this introduces errors that cause systematic bias in the gradients, which in turn reduces convergence when closer to the minima. We develop algorithms to detect this critical transition and either switch to serial training, or systematically increase the accuracy of layer-parallel training. Results, including the BERT, GPT, ViT, and machine translation architectures, demonstrate parallel-acceleration as well as accuracy commensurate with serial pre-training while fine-tuning is unaffected.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 18760
Loading