System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

Published: 01 Jan 2024, Last Modified: 01 Oct 2024IPDPS (Workshops) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Long sequences are ubiquitous in NLP tasks such as document summarization, machine translation, and dialogue modeling [1]–[9]. Traditional approaches to parallelism, including data parallelism [10]–[12], tensor [13] and pipeline parallelism [14]–[16] struggle to handle sequences that span thousands or even millions of tokens.
Loading