Layer-Parallel Training for Transformers

Layer-Parallel Training for Transformers

ICLR 2026 Conference Submission18760 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Transformers, Training methodology, Multi-level parallelism, Neural ODE, Parallel-in-time algorithm, BERT, Vision Transformer (ViT), Machine translation, GPT2

TL;DR: We apply a novel layer-parallel training paradigm to transformer models which scales with network depth.

Abstract: We present a new training methodology for transformers using a multi-level layer-parallel approach. Through a neural ODE formulation of transformers, our application of a multilevel parallel-in-time algorithm for the forward and back propagation phases of training achieves parallel acceleration over the layer dimension. This dramatically enhances parallel scalability as the network depth increases, particularly useful in large foundational models. However, achieving this introduces errors that cause systematic bias in the gradients, which in turn reduces convergence when closer to the minima. We develop algorithms to detect this critical transition and either switch to serial training, or systematically increase the accuracy of layer-parallel training. Results, including the BERT, GPT, ViT, and machine translation architectures, demonstrate parallel-acceleration as well as accuracy commensurate with serial pre-training while fine-tuning is unaffected.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 18760

Loading