['1c1', '< Title: DEPT: DECOUPLED EMBEDDINGS FOR PRE-TRAINING LANGUAGE MODELS', '---', '> Title: DEPT: Decoupled Embeddings for Robust and Efficient Language Model Pre-training', '3c3', '< Abstract: Language Model pre-training uses broad data mixtures to enhance performance across domains and languages. However, training on such heterogeneous text corpora requires extensive and expensive efforts. Since these data sources vary significantly in lexical, syntactic, and semantic aspects, they cause negative interference or the "curse of multilinguality". To address these challenges we propose a communication-efficient pre-training framework, DEPT. Our method decouples embeddings from the transformer body while simultaneously training the latter on multiple data sources without requiring a shared vocabulary. DEPT can: (1) train robustly and effectively under significant data heterogeneity, (2) minimize token embedding parameters to only what the data source vocabulary requires, while cutting communication costs in direct proportion to both the communication frequency and the reduction in parameters, (3) enhance transformer body plasticity and generalization, improving both average perplexity (up to 20%) and downstream task performance, and (4) enable training with custom optimized vocabularies per data source. We demonstrate DEPT\'s potential via the first vocabularyagnostic federated pre-training of billion-scale models, reducing communication costs by orders of magnitude and embedding memory by 4 -5×.', '---', '> Abstract: Language Model (LM) pre-training leverages diverse data mixtures to enhance generalization across domains and languages. However, training on such heterogeneous text corpora is resource-intensive and often leads to negative interference or the "curse of multilinguality" due to significant variations in lexical, syntactic, and semantic properties. To address these critical challenges, we introduce DEPT, a novel and communication-efficient pre-training framework. DEPT decouples embeddings from the transformer body, allowing the latter to be trained simultaneously on multiple data sources without requiring a shared vocabulary. This approach offers four key advantages: (1) it enables robust and effective training amidst substantial data heterogeneity, (2) it minimizes token embedding parameters to only what each data source vocabulary requires, drastically reducing communication costs, (3) it enhances transformer body plasticity and generalization, leading to improved average perplexity (up to 20%) and superior downstream task performance, and (4) it facilitates training with custom, optimized vocabularies per data source. We demonstrate DEPT\'s potential through the first vocabulary-agnostic federated pre-training of billion-scale models, achieving orders of magnitude reduction in communication costs and a 4-5× reduction in embedding memory.', '7,17c7,24', '< Existing methods for pre-training on heterogeneous data are costly and complex. Multilingual models like BERT (Devlin et al., 2019), XLM (Conneau et al., 2020), and mT5 (Xue et al., 2021) require temperature-tuning of language sampling ratios for each model-tokenizer pair, involving expensive model selection to optimize perplexity (Conneau et al., 2020). Large Language Models (LLMs) such as LLaMA handle heterogeneous data with intensive "language-specific heuristics and modelbased filters" (Dubey et al., 2024). However, these methods still face challenges such as vocabulary dilution (Rust et al., 2021) and sub-optimal cross-lingual/domain performance (Chang et al., 2023a).', '< This paper proposes a communication-efficient pre-training pipeline to address heterogeneous data challenges. Observing that custom vocabularies boost performance across languages (Rust et al., 2021) and domains (McLeish et al., 2024), we propose partially or fully decoupling the embedding space from transformer bodies. This approach optimizes embeddings for specific data sources while the transformer learns abstract representations. We introduce Decoupled Embeddings for Pre-Training (DEPT) in three variants, GLOB, TRIM, and SPEC (see Fig. 1), each increasingly leveraging (2) corpora are tokenized into a pre-tokenized dataset;', '< (3) WORKERS train the model on their pre-tokenized data; (4) partial training results are collected;', '< (5) results are aggregated; (6) the new model is sent to WORKERS. Steps 3-6 repeat to convergence.', '< specialized representations to allow pre-training with distinct domains/languages, embedding matrices, and vocabularies. For example, our SPEC variant scales the vocabulary size linearly with the number of data sources without increasing memory requirements.', '< DEPT enables pre-training on heterogeneous data sources with unique vocabularies and linguistic features. In the DEPT pipeline, data sources are isolated as silos, akin to clients in cross-silo Federated Learning (FL) (McMahan et al., 2017b). DEPT trains on each silo and aggregates contributions like FL clients. This work examines whether an LM can converge on data mixtures without a shared (1) output vocabulary, (2) embedding matrices, or (3) tokenization.', '< Algorithm 1 Decoupled Embedding for Pre-Training (DEPT) variants: GLOB TRIM SPEC Require: S: set of K data sources, T : number of rounds Require: θ0: initial transformer blocks, ϕ0, ψ0: optional token/positional embeddings Require: {D k } K k=1 : source-specific datasets, {V k } K k=1 : source-specific vocabularies Require: InnerOPT: inner optimizer, OuterOPT: outer optimizer, e.g., AdamW and FedAvg 1: for each update round t = 1, 2, . . . , T do 2: Randomly select a subset St ⊆ S of data sources for round t 3:', '< for each data source k ∈ St in parallel do 4: θ k t , ϕ k t , ψ k t ← InnerOPT(θt-1, ϕt-1, ψt-1, D k ) ▷ GLOB: Global embeddings 5:', '< ϕt-1|V k = Trim(ϕt-1, V k ) ▷ TRIM: Trim global token embeddings 6: Our method, DEPT, achieves this decoupling by: (1) tokenizing data sources independently, using a global or custom vocabulary;', '< θ k t , ϕt|V k , ψ k t ← InnerOPT(θt-1, ϕt-1|V k , ψt-1, D k ) ▷ TRIM 7: θ k t , ϕ k t , ψ k t ← InnerOPT(θt-1, ϕ k t-1 , ψ k t-', '< (2) randomly initializing LM parameters; and (3) training iteratively over random source subsets (see Section 2). This contrasts with standard pre-training, which uses shared embeddings and draws random samples from a distribution of all sources.', '---', '> Existing methods for pre-training on heterogeneous data are often resource-intensive and complex. Multilingual models like BERT (Devlin et al., 2019), XLM (Conneau et al., 2020), and mT5 (Xue et al., 2021) necessitate meticulous temperature-tuning of language sampling ratios for each model-tokenizer pair, demanding expensive model selection to optimize perplexity (Conneau et al., 2020). Similarly, Large Language Models (LLMs) such as LLaMA address heterogeneous data through intensive "language-specific heuristics and model-based filters" (Dubey et al., 2024). Despite these efforts, these conventional methods still grapple with critical issues such as vocabulary dilution (Rust et al., 2021) and sub-optimal cross-lingual/domain performance (Chang et al., 2023a).', '> ', '> This paper introduces DEPT (Decoupled Embeddings for Pre-Training), a novel communication-efficient pre-training pipeline designed to overcome these pervasive challenges. Our core insight is that custom vocabularies significantly boost performance across languages (Rust et al., 2021) and domains (McLeish et al., 2024). Building on this, we propose partially or fully decoupling the embedding space from the transformer body. This allows for optimizing embeddings for specific data sources while the transformer learns more abstract and generalizable representations. We present DEPT in three distinct variants: GLOB, TRIM, and SPEC (see Fig. 1). Each variant progressively leverages specialized representations to enable pre-training with diverse domains, languages, embedding matrices, and vocabularies. For instance, our SPEC variant can scale vocabulary size linearly with the number of data sources without increasing overall memory requirements.', '> ', '> DEPT facilitates pre-training on heterogeneous data sources with unique vocabularies and linguistic features. Within the DEPT pipeline, data sources are treated as isolated silos, analogous to clients in cross-silo Federated Learning (FL) (McMahan et al., 2017b). DEPT trains on each silo and aggregates contributions similar to FL clients. This work rigorously investigates whether an LM can achieve convergence on data mixtures without relying on a shared (1) output vocabulary, (2) embedding matrices, or (3) tokenization. Our method, DEPT, achieves this decoupling by: (1) tokenizing data sources independently, using either a global or custom vocabulary; (2) randomly initializing LM parameters; and (3) training iteratively over random source subsets (see Section 2). This fundamentally contrasts with standard pre-training, which employs shared embeddings and draws random samples from a monolithic distribution of all sources.', '> ', '> Algorithm 1 Decoupled Embedding for Pre-Training (DEPT) variants: GLOB TRIM SPEC', '> Require: S: set of K data sources, T : number of rounds', '> Require: θ0: initial transformer blocks, ϕ0, ψ0: optional token/positional embeddings', '> Require: {D k } K k=1 : source-specific datasets, {V k } K k=1 : source-specific vocabularies', '> Require: InnerOPT: inner optimizer, OuterOPT: outer optimizer, e.g., AdamW and FedAvg', '> 1: for each update round t = 1, 2, . . . , T do', '> 2: Randomly select a subset St ⊆ S of data sources for round t', '> 3: for each data source k ∈ St in parallel do', '> 4: θ k t , ϕ k t , ψ k t ← InnerOPT(θt-1, ϕt-1, ψt-1, D k ) ▷ GLOB: Global embeddings', '> 5: ϕt-1|V k = Trim(ϕt-1, V k ) ▷ TRIM: Trim global token embeddings', '> 6: θ k t , ϕt|V k , ψ k t ← InnerOPT(θt-1, ϕt-1|V k , ψt-1, D k ) ▷ TRIM', '> 7: θ k t , ϕ k t , ψ k t ← InnerOPT(θt-1, ϕ k t-1 , ψ k t-', '399d405', '< ']
