Keywords: Decentralized Training, Federated Learning, Multi-domain Training, Multilingual Training
TL;DR: We propose DEPT, a pre-training framework that decouples embedding layers from the transformer body, enabling robust training on heterogeneous data, improving generalization, and reducing memory footprint by up to 80%.
Abstract: Past works have shown that lexical, syntactical, and semantical differences in heterogeneous data sources can cause challenges such as negative interference or the ``curse of multilinguality''. Because of this, training on such heterogeneous corpora requires extensive and costly efforts to balance data mixtures. We propose a novel pre-training framework to alleviate this curse. Our method, DEPT, decouples embeddings from the transformer body while simultaneously training the latter in multiple contexts without a shared global vocabulary. DEPT: (1) trains robustly and effectively under significant data heterogeneity, (2) reduces token embedding parameters by up to 80% and communication costs by 714x for billion-scale models, (3) enhances transformer body plasticity and generalization, improving average perplexity upward of 15.3-20% and improving performance for downstream fine-tuning in our experiments, and (4) permits training with custom optimized vocabularies per data source. We demonstrate DEPT's potential via the first vocabulary-agnostic federated multilingual pre-training of a billion-scale model, reducing total parameters by 24% versus standard training.
Primary Area: foundation or frontier models, including LLMs
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 11135
Loading