Worldwide Federated Training of Language Models

Published: 01 Oct 2024, Last Modified: 17 Oct 2024FL@FM-NeurIPS'24 OralEveryoneRevisionsBibTeXCC0 1.0
Keywords: Federated Learning, Distributed Training, Language Modeling, Natural Language Processing
TL;DR: WorldLM is an LM training system based on federations of federations, enabling actors with varying regulatory, privacy, and security concerns to collaborate. It accounts for data heterogeneity via attention-based aggregation and residual embeddings.
Abstract: Centralized language model (LM) training requires vast datasets, raising legal, ethical, and practical concerns. Federated learning (FL) offers an alternative by enabling organizations to leverage untapped data while minimizing data movement collaboratively. However, scaling FL globally introduces challenges such as legal, privacy, and statistical heterogeneity in language data. We propose Worldwide Federated Training of Language Models (WorldLM), a system that builds federations of federations to tackle these issues. WolrdLM enables each federation to autonomously meet jurisdictional or competitive constraints, while managing statistical heterogeneity through attention-based aggregation of key layers and cross-federation information sharing via residual embeddings. WolrdlM outperforms standard FL by up to 1.91x in terms of perplexity, Hierarchical Federated Averaging (HierFAVG) by 1.86x and Federated Learning with Personalized Layers (FedPer) by 3.3x. WorldLM scales to models with 400M parameters, achieving 1.39x lower perplexity than centralized counterparts while retaining compute efficiency, and it approaches the performance of perfectly localized models trained in an infinite-data regime. Additionally, under differential privacy constraints, WorldLM proves highly resilient compared to standard FL methods, which diverge. These results establish WorldLM as an effective means for federated pre-training across geographic and legal boundaries.
Submission Number: 54
Loading