DEPT: Decoupled Embeddings for Pre-training Language Models

Alex Iacob; Lorenzo Sani; Meghdad Kurmanji; William F. Shen; Xinchi Qiu; Dongqi Cai; Yan Gao; Nicholas Donald Lane

DEPT: Decoupled Embeddings for Pre-training Language Models

Alex Iacob, Lorenzo Sani, Meghdad Kurmanji, William F. Shen, Xinchi Qiu, Dongqi Cai, Yan Gao, Nicholas Donald Lane

Published: 22 Jan 2025, Last Modified: 01 Mar 2025ICLR 2025 OralEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Decentralized Training, Federated Learning, Multi-domain Training, Multilingual Training

TL;DR: We propose DEPT, a pre-training framework that decouples embedding layers from the transformer body, enabling robust training on heterogeneous data, improving generalization, and reducing memory footprint.

Abstract: Language Model pre-training uses broad data mixtures to enhance performance across domains and languages. However, training on such heterogeneous text corpora requires extensive and expensive efforts. Since these data sources vary significantly in lexical, syntactic, and semantic aspects, they cause negative interference or the ``curse of multilinguality''. To address these challenges we propose a communication-efficient pre-training framework, DEPT. Our method decouples embeddings from the transformer body while simultaneously training the latter on multiple data sources without requiring a shared vocabulary. DEPT can: (1) train robustly and effectively under significant data heterogeneity, (2) minimize token embedding parameters to only what the data source vocabulary requires, while cutting communication costs in direct proportion to both the communication frequency and the reduction in parameters, (3) enhance transformer body plasticity and generalization, improving both average perplexity (up to 20%) and downstream task performance, and (4) enable training with custom optimized vocabularies per data source. We demonstrate DEPT's potential via the first vocabulary-agnostic federated pre-training of billion-scale models, reducing communication costs by orders of magnitude and embedding memory by 4-5x.

Primary Area: foundation or frontier models, including LLMs

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 11135

Loading