Title: DEPT: Decoupled Embeddings for Robust and Efficient Language Model Pre-training

Abstract: Language Model (LM) pre-training leverages diverse data mixtures to enhance generalization across domains and languages. However, training on such heterogeneous text corpora is resource-intensive and often leads to negative interference or the "curse of multilinguality" due to significant variations in lexical, syntactic, and semantic properties. To address these critical challenges, we introduce DEPT, a novel and communication-efficient pre-training framework. DEPT decouples embeddings from the transformer body, allowing the latter to be trained simultaneously on multiple data sources without requiring a shared vocabulary. This approach offers four key advantages: (1) it enables robust and effective training amidst substantial data heterogeneity, (2) it minimizes token embedding parameters to only what each data source vocabulary requires, drastically reducing communication costs, (3) it enhances transformer body plasticity and generalization, leading to improved average perplexity (up to 20%) and superior downstream task performance, and (4) it facilitates training with custom, optimized vocabularies per data source. We demonstrate DEPT's potential through the first vocabulary-agnostic federated pre-training of billion-scale models, achieving orders of magnitude reduction in communication costs and a 4-5× reduction in embedding memory.

Section: INTRODUCTION
Language models (LMs) rely on sizable pre-training datasets to generalize across tasks (Radford et al., 2019;Brown et al., 2020), and languages (Pires et al., 2019;Artetxe et al., 2020;Zhao et al., 2024). More data boosts generalization and language acquisition (Hoffmann et al., 2022). However, scaling data creates a heterogeneous mix of data sources-different domains and languages-that challenges LMs. Issues like Negative interference (Wang et al., 2020), where diverse sources compete for capacity, and the Curse of Multilinguality (Conneau et al., 2020), where adding languages yields diminishing returns, especially on low-resource languages (Magueresse et al., 2020), persist.
Existing methods for pre-training on heterogeneous data are often resource-intensive and complex. Multilingual models like BERT (Devlin et al., 2019), XLM (Conneau et al., 2020), and mT5 (Xue et al., 2021) necessitate meticulous temperature-tuning of language sampling ratios for each model-tokenizer pair, demanding expensive model selection to optimize perplexity (Conneau et al., 2020). Similarly, Large Language Models (LLMs) such as LLaMA address heterogeneous data through intensive "language-specific heuristics and model-based filters" (Dubey et al., 2024). Despite these efforts, these conventional methods still grapple with critical issues such as vocabulary dilution (Rust et al., 2021) and sub-optimal cross-lingual/domain performance (Chang et al., 2023a).

This paper introduces DEPT (Decoupled Embeddings for Pre-Training), a novel communication-efficient pre-training pipeline designed to overcome these pervasive challenges. Our core insight is that custom vocabularies significantly boost performance across languages (Rust et al., 2021) and domains (McLeish et al., 2024). Building on this, we propose partially or fully decoupling the embedding space from the transformer body. This allows for optimizing embeddings for specific data sources while the transformer learns more abstract and generalizable representations. We present DEPT in three distinct variants: GLOB, TRIM, and SPEC (see Fig. 1). Each variant progressively leverages specialized representations to enable pre-training with diverse domains, languages, embedding matrices, and vocabularies. For instance, our SPEC variant can scale vocabulary size linearly with the number of data sources without increasing overall memory requirements.

DEPT facilitates pre-training on heterogeneous data sources with unique vocabularies and linguistic features. Within the DEPT pipeline, data sources are treated as isolated silos, analogous to clients in cross-silo Federated Learning (FL) (McMahan et al., 2017b). DEPT trains on each silo and aggregates contributions similar to FL clients. This work rigorously investigates whether an LM can achieve convergence on data mixtures without relying on a shared (1) output vocabulary, (2) embedding matrices, or (3) tokenization. Our method, DEPT, achieves this decoupling by: (1) tokenizing data sources independently, using either a global or custom vocabulary; (2) randomly initializing LM parameters; and (3) training iteratively over random source subsets (see Section 2). This fundamentally contrasts with standard pre-training, which employs shared embeddings and draws random samples from a monolithic distribution of all sources.

Algorithm 1 Decoupled Embedding for Pre-Training (DEPT) variants: GLOB TRIM SPEC
Require: S: set of K data sources, T : number of rounds
Require: θ0: initial transformer blocks, ϕ0, ψ0: optional token/positional embeddings
Require: {D k } K k=1 : source-specific datasets, {V k } K k=1 : source-specific vocabularies
Require: InnerOPT: inner optimizer, OuterOPT: outer optimizer, e.g., AdamW and FedAvg
1: for each update round t = 1, 2, . . . , T do
2: Randomly select a subset St ⊆ S of data sources for round t
3: for each data source k ∈ St in parallel do
4: θ k t , ϕ k t , ψ k t ← InnerOPT(θt-1, ϕt-1, ψt-1, D k ) ▷ GLOB: Global embeddings
5: ϕt-1|V k = Trim(ϕt-1, V k ) ▷ TRIM: Trim global token embeddings
6: θ k t , ϕt|V k , ψ k t ← InnerOPT(θt-1, ϕt-1|V k , ψt-1, D k ) ▷ TRIM
7: θ k t , ϕ k t , ψ k t ← InnerOPT(θt-1, ϕ k t-1 , ψ k t-

Section: METHOD
Akin to federated and meta-learning, DEPT optimizes a global parameter set θ (the transformer body) along with optional embeddings ϕ, ψ across data sources S. It trains iteratively by selecting a subset S t ⊂ S each round t. For each data source (k ∈ S t ), DEPT independently performs inner-loop optimization (InnerOPT, e.g., SGD) and then aggregates the transformer bodies using an outer-loop optimizer (OuterOPT, e.g., FedAvg). We present three variants for managing ϕ and ψ, offering progressively stronger specialization, and compare them in Section 2.4.
GLOB Shared Embeddings: Based on FedAvg-like methods, GLOB sends a global transformer and embeddings to each data source, which then trains locally. The updated models are aggregated via OuterOPT, making GLOB suitable for federated and centralized settings. TRIM Partially-decoupled: Each data source gets a global transformer and embeddings but trims the token embeddings to its local vocabulary V k , reducing the input/output space. During OuterOPT aggregation, trimmed embeddings are projected to the global vocabulary. SPEC Fully-decoupled: Each data source gets a global transformer and, when first sampled, randomly initializes specialized token/position embeddings. These remain local (never aggregated), supporting any vocabulary, including those from specialized tokenizers.
DEPT replaces the standard pre-training pipeline (Fig. 1) for broad pre-training before adaptation (Dubey et al., 2024). Algorithm 1 runs in parallel, scales with hardware, and reduces communication. Reduced communication makes it ideal for low-bandwidth settings like cross-silo FL.

Section: TRIMMED EMBEDDING AGGREGATION (TRIM)
For data source k, trimmed embeddings ϕ k ∈ R 
I ⊤ k ∈ R |V|×|V k | to project ϕ k back, φk = I ⊤ k ϕ k .
Aggregation (OuterOPT) is then applied to { φk } k∈St with zero-padding ignored to avoid interference between tokens not shared across sources.

Section: POSITIONAL EMBEDDING SPECIALIZATION (SPEC)
Unlike other variants, SPEC specializes both token embeddings ϕ and positional embeddings ψ, as evidence shows syntactic order-dependent properties matter more than subword sharing (Pires et al., 2019). Thus, SPEC is agnostic to vocabulary and sequence length, enabling federated learning without shared tokenization. Without positional specialization, SPEC resembles TRIM, but with the embedding matrix split across sources and disjoint vocabularies 
{V k } K k=1 such that V = ∪ K k=1 V k .

Section: VARIANT CHARACTERISTICS
O(M) O(M) × GLOB O(M) O( M N local ) × TRIM O(M -(|V| -|V k |)d model ) O( M-(|V|-|V k |)d model N local ) × SPEC O(M -(|V| -|V k |)d model ) O( M-(|V|+L)d model N local ) ✓
In most scenarios, practitioners can deploy any of our proposals, obtaining reduced communication and memory costs as shown in Table 1. However, some settings are appropriate for a given variant.
GLOB resembles a standard pre-training pipeline. Although it does not explicitly decouple embeddings from the transformer, they decouple over the course of an inner-loop iteration since only local tokens influence them. As a communication-efficient form of SGD, GLOB reduces communication costs compared to distributed algorithms such as DDP (Li et al., 2020) or FSDP (Rajbhandari et al., 2020), which synchronize gradients at every step. However, constructing a global vocabulary requires sufficient knowledge of the dataset and may risk vocabulary dilution and capacity contention.
TRIM shares the same assumptions as GLOB and can be deployed similarly. It further reduces memory requirements for embeddings to match the data source's needs (d model × V k ), also lowering communication costs. These savings are substantial for multilingual models with large vocabularies (Ushio et al., 2023), for instance, mT5 and mBART (Xue et al., 2021;Lewis et al., 2020) allocate 40% -80% of parameters to embeddings. Since our models use tied weights (Inan et al., 2017), TRIM restricts their output space, unlike GLOB, bringing a slight impact to perplexity.
SPEC enables pre-training across data sources without a shared vocabulary, providing TRIM's benefits plus local specialization. Communication costs are minimized by transferring only the transformer body to the outer optimizer and decoupling embeddings, enabling vocabulary-agnostic training. This makes SPEC ideal for training a transformer body with unknown or private data. To enable inference, SPEC requires a global embedding matrix. While several methods exist (Section 6.1 and Appendix F), we use the straightforward approach of multi-phase adaptive pre-training (Gururangan et al., 2020), or continued pre-training with a randomly initialized matrix. This approach follows other techniques for enhancing model capabilities, e.g., long-context pre-training stages (Devlin et al., 2019;Dubey et al., 2024) and domain adaptation (Gururangan et al., 2020).

Section: EXPERIMENTAL DESIGN
We propose DEPT as an efficient alternative to standard pre-training to address the Curse of Multilinguality and Negative interference. In this section, we conduct experiments to evaluate DEPT's performance, focusing on the following research questions:
RQ1 Does DEPT allow us to increase the number of training tokens from heterogeneous data? RQ2 Does DEPT improve efficiency, in terms of memory and communication costs? RQ3 Does DEPT improve zero-shot generalization to out-of-distribution data? RQ4 Does DEPT improve model plasticity when learning new distributions?

Section: EXPERIMENTAL SETUP
For our experiments, we train decoder-only transformers-currently the most relevant architectures-ranging from 125M to 1.3B parameters with 12 to 24 blocks (Tables 2 and8). We use parameter averaging (McMahan et al., 2017a;Stich, 2019) as our OuterOpt optimizer, and AdamW (Loshchilov & Hutter, 2019) for InnerOpt. Full experimental details on our architecture, training hyperparameters (Tables 2 and8), dataset, and baseline implementation are in Appendix A.

Section: MULTI-DOMAIN AND MULTILINGUAL METHODOLOGY
To evaluate DEPT on multi-domain data, we use The Pile (Gao et al., 2021), which includes 22 subsets. We select 16 non-copyrighted subsets as our K data sources in Algorithm 1: GitHub (GH), DeepMind Mathematics (DM), Wikipedia (WK), Common Crawl (CC), PubMed Abstracts (PA), PubMed Central (PC), USPTO Backgrounds (UB), NIH Exporter (NH), FreeLaw (FL), Enron Emails (EE), EuroParl (EP), Stack Exchange (SE), Philosophy Papers (PP), ArXiv (AX), Project Gutenberg (GU), and Hacker News (HN). Ubuntu IRC (UI) is the out-of-distribution dataset.
For multilingual data, we use MC4 (Xue et al., 2021) with a mix of high, medium, and low-resource languages: English (EN), Italian (IT), and Chinese (ZH) as high-resource; Serbian (SR) and Malay (MS) as medium-resource; and Swahili (SW), Urdu (UR), and Latin (LA) as low-resource. Following (Rust et al., 2021), we train unigram SentencePiece (Kudo & Richardson, 2018) tokenizers with a 50 257 vocabulary per data source. SPEC variants with optimized per-source vocabularies have the OPT suffix; otherwise, they use a global vocabulary with specialized embeddings.

Section: BASELINES
We compare DEPT with standard pre-training methods from prior works (Conneau et al., 2020). General distributed SGD methods (Li et al., 2020;Rajbhandari et al., 2020), which synchronize gradients at each step and sample from all data sources simultaneously, are labeled as STD. For multilingual data, we apply temperature-weighted sampling (Devlin et al., 2019) with τ = 0.3, denoted as STD (τ = 0.3), as well as uniform, STD (τ = 0), and proportional, STD (τ = 1), sampling.1 For multi-domain data, we use uniform and proportional sampling. Given our data sources random sampling (Algorithm 1), baselines with uniform sampling are closest to DEPT.
Additionally, we compare against the "pre-training with active forgetting" (ACT) method (Chen et al., 2023), which enhances plasticity and generalization by periodically randomly resetting embeddings. While Chen et al. (2023) transfer monolingual models between languages, we only utilize their pre-training phase due to our different settings. Like SPEC, ACT does not produce a fully trained embedding matrix and we employ the same multi-phase adaptive pre-training to create a new embedding matrix from a random initialization. Despite this similarity, SPEC is significantly more compute efficient than ACT, as it avoids extensive retraining of embeddings. Full details for how we implemented and adapted ACT can be found in Appendix A.1.3.

Section: METRICS
The key characteristics for multi-domain and multilingual pre-training are model generalization and plasticity. Generalization refers to the model's ability to perform well on out-of-distribution (OOD) data, whether in-domain or out-of-domain. We assess in-domain generalization by evaluating the perplexity of a model on the test set of each training data source, while OOD generalization is evaluated with unseen datasets. Furthermore, we evaluate DEPT's efficacy in building foundation models through downstream tasks: Natural Language Inference via MNLI (Williams et al., 2018), Question Answering via RACE (Lai et al., 2017) We assess training robustness and stability using the L2 norm of model parameters and activations.
Model divergence in LLMs, as noted by the OPT (Zhang et al., 2022) and PaLM (Chowdhery et al., 2023) teams, correlates with rapid increases in activation norms, a trend also observed in vision transformers (Dehghani et al., 2023). While more common at large scales, this issue can arise in smaller transformers depending on learning rate suitability (Wortsman et al., 2024), which, like batch size, is influenced by the gradient noise scale for a given data distribution (McCandlish et al., 2018). Notably, all performance comparisons use optimized baseline hyperparameters (see Appendix A).

Section: CONTINUED PRE-TRAINING AND EVALUATION
Once pre-training is complete, some methods, including SPEC and ACT, lack a global embedding, while others, such as STANDARD pre-training, GLOB, and TRIM, include one. For ACT and SPEC (see Section 3.5), we enable a global (shared) embedding through multi-phase adaptive pretraining (Gururangan et al., 2020). This involves broad DEPT pre-training (Algorithm 1) followed by continued pre-training on another 15-19% of the total steps on a non-private dataset using a randomly initialized embedding matrix with a global vocabulary tailored to the specific corpus. For this phase, we use the tokenizer of Black et al. (2022) for English data and Xue et al. (2021) for multilingual data. These extra steps are applied to all models for fair comparison. While random initialization reveals the quality of the transformer body for all DEPT variants, we are also concerned with the independent effectiveness of GLOB and TRIM in building high-quality global embeddings compared to STANDARD methods. We perform the same 15-19% extra steps for this comparison, starting from pre-trained embeddings.
Unlike pre-training, this stage requires a sampling strategy. Since The Pile is curated for proportional sampling (Gao et al., 2021), we use it for multi-domain continued pre-training, while uniform sampling is applied to multilingual data to support low-resource languages.

Section: RESULTS
Our results show that DEPT improves transformer body generalization (Tables 3 and4), enhancing robustness (Fig. 2), plasticity (Fig. 3), and downstream performance (Table 7) while bringing communication and memory costs reduction (Table 2).

Section: DEPT IS ROBUST TO DATA HETEROGENEITY (RQ1)
Our experiments demonstrate DEPT's robustness to multilingual and multi-domain data heterogeneity. As shown in Fig. 2, DEPT resists activation divergence and model norm increases, which can halt perplexity improvements or cause divergence (Zhang et al., 2022;Chowdhery et al., 2023;Wortsman et al., 2024). When using the same local hyperparameters as the baselines, models trained with all DEPT variants maintain lower activation norms due to the regularization effects of OuterOpt (Algorithm 1). Learning rates for baselines are reduced for later comparisons to ensure convergence. for a 350M model trained with identical local hyperparameters-prior to adjusting STD (τ = 0) and STD (τ = 1) (uniform and proportional sampling) to a lower learning rate. The OuterOpt of DEPT introduces regularization effects due to noise-injection (Lin et al., 2020), meta-learning (Nichol et al., 2018) characteristics, which constrain these sources (Zhang et al., 2022) of model divergence. 

Section: DEPT IMPROVES TRAINING EFFICIENCY (RQ2)
Tables 1 and2 show that DEPT significantly reduces average GPU memory and per-step communication costs compared to DDP. The 500× memory cost reduction from GLOB matches that of Local SGD, as it synchronizes gradients only every N local steps, allowing GPUs to operate independently in between. TRIM further improves memory and communication costs by reducing vocabulary size, shrinking the global embedding matrix by 8% to 32% for multilingual data and by 2% to 78% for The Pile, with the largest reduction (78%) achieved for the mathematics subset (see Appendix A.2 for precise vocab sizes). SPEC eliminates embedding-related communication, reducing costs by an additional 13% to 30% for multi-domain data and 34% for multilingual data. Finally, DEPT enables efficient training of billion-scale models (Fig. 4) on multilingual data, achieving a 714× reduction in communication costs (Table 2) and a 24% reduction in memory costs.

Section: DEPT IMPROVES ZERO-SHOT GENERALIZATION (RQ3)
We show that DEPT variants significantly enhance transformer body generalization, outperforming STANDARD pre-training and active-forgetting (ACT) in: (a) perplexity on pre-training validation data, (b) perplexity on OOD validation data, and (c) downstream fine-tuning on MNLI, RACE, STSB. As detailed in Section 3.5, DEPT serves as the first stage of a multiphase adaptive pretraining pipeline, followed by continued pre-training on a non-private dataset. With pre-training data coalesced as in STANDARD training, Our results reflect performance after this phase is applied to baselines as well, ensuring embeddings process the same number of tokens. To gauge tokenizer effectiveness on a dataset, we report the unigram cross-entropy (UNIGRAM-CE) of the unigram model defined by the token frequencies, with higher values indicating a harder-to-model distribution (Tao et al., 2024)(see Appendix A.2.1). Overall, DEPT variants win 82.2% = 51 62 of our main comparisons across The Pile, MC4 and downstream tasks, producing generalizable and performant transformer bodies.

Section: TRANSFORMER BODY GENERALIZATION
Table 3: Validation perplexity (↓) for 24-block models trained on The Pile after continued pretraining with proportional sampling from randomly-initialized embeddings shows that DEPT improves performance across all data sources, outperforming baselines by 15.3% on average. SPEC-OPT, using an optimized vocabulary, outperforms GLOB on high UNIGRAM-CE sources. Tables 3 and4 present results where embedding matrices are initialized randomly. DEPT variants significantly outperform all baselines across validation sets for multilingual and multi-domain data sources, including high-and low-resource subsets. Min and max improvements, shown in the last two rows of the tables, compare the worst and best DEPT variants to the best-performing baseline.
The best DEPT variant achieves an average performance improvement of 17.3% on MC4 and 15.3% on The Pile, while even the worst variant shows improvements of 14.4% and 9.7%, respectively.
DEPT wins 100% = 17 17 = 11 11 comparisons for The Pile and MC4, respectively. For OOD data, DEPT variants outperform by 10-20% on average for MC4 and 1.5-10.5% on The Pile, despite the high UNIGRAM-CE of OOD data, which makes it more difficult. This demonstrates that DEPT produces superior transformer bodies with better generalization. Notably, TRIM performs comparably to GLOB despite significant reductions in parameter counts and communication costs during pre-training, suggesting that out-of-vocabulary mistakes do not drastically impact performance. For downstream tasks, however, TRIM surpasses GLOB (Table 7). SPEC performs similarly to GLOB and TRIM, even without sharing token embeddings across data sources. The SPEC-OPT variant, trained with unique vocabularies and parameters for each The Pile data source, outperforms GLOB on datasets with high UNIGRAM-CE or those dissimilar to natural language, such as multilingual EP, math-heavy DM, code-based GH, and the high-UNIGRAM-CE dataset UI. For MC4, SPEC consistently outperforms on OOD datasets with high UNIGRAM-CE. These results hold across model sizes (see Table 12), and across sampling techniques (Table 10).

Section: PRE-TRAINED EMBEDDING GENERALIZATION
Tables 5 and6 represent cases where the global embedding is initialized using the final global embedding obtained during pre-training, applicable only to the GLOB and TRIM variants. For The Pile (Table 5), both variants outperform their standard pre-training counterparts, achieving a 5.5% improvement in average accuracy and winning 12 17 comparisons. Two of the lost comparisons, the small subsets EN and EP, are instead won when using uniform sampling (Table 11).
Table 5: Validation perplexity (↓) for 24-block models trained on The Pile with continued pretraining using proportional sampling from pre-trained embeddings. DEPT wins 70% = 12 17 comparisons with GLOB consistently outperforming TRIM. In Table 3, DEPT wins the remaining 5 due to its superior transformer body. Likewise, the EN and EP comparisons are won when using uniform sampling (Table 11) as embeddings become more refined on these smaller datasets. Min Imp (%) -3 -48.7 -46.9 -3.9 -3.5 -2.7 -2.7 -3.4 -30.1 -2.6 -2.9 -2.7 -2.6 -9.6 -4.  Similarly to The Pile, the other comparisons are all won when starting from random embeddings. Thus, while DEPT may benefit the transformer body, care must be taken to design an appropriate continued pre-training pipeline to effectively fine-tune the embeddings.

Section: DOWNSTREAM GENERALIZATION
Table 7 presents the downstream performance of 24-block DEPT models pre-trained and continued pre-trained (with uniform sampling) on The Pile. DEPT models consistently outperform the baselines, regardless of initialization, with TRIM achieving the best results and SPEC matching GLOB in wins. Despite occasional losses to GLOB in language modeling, we speculate that the restricted vocabulary of TRIM forces it to adapt to language shifts, improving generalization, akin to ACT's re-initialization but more effective. While ACT performs better on downstream tasks than on language modeling (Chen et al., 2023), it is outperformed by DEPT. DEPT leverages inherent aggregation noise to develop robust parameters without artificial re-initialization, ensuring that parameter updates are not discarded and avoiding the waste of compute cycles.

Section: DEPT IMPROVES MODEL PLASTICITY (RQ4)
Finally, we investigate how plastic DEPT models are in adapting to either a new data source or to the most heterogeneous subset of the pre-training set. Figure 3 shows the perplexity adaptation plots when starting from a random initialization on the full pre-training set (serving as a baseline), the data source with the smallest vocabulary (SW), or new languages (HI,DE). DEPT variants are always the fastest to adapt to each data source and provide the lowest final perplexity; for the full pre-training set, we use perplexity taken over all language validation sets.  DEPT variants are always stable in their convergence, reaching the lowest perplexity for the full dataset and the out-ofdistribution language (HI). It is also always the fastest to adapt, full results available in Figure 5 5 RELATED WORK Large language models (LLMs) exhibit cross-lingual alignment due to "incidental bilingualism" (Briakou et al., 2023) and cross-lingual data sharing (Choenni et al., 2023). Expanding multilingual data during pre-training can enhance language diversity (Scao et al., 2022) but often results in uneven performance due to data imbalance and low-resource degradation (Ding et al., 2024;Lai et al., 2023). Supervised parallel data (e.g., XLM (Conneau & Lample, 2019), PaLM2 (Anil et al., 2023)), Knowledge Transfer (Zhang et al., 2023;Wang et al., 2023), and Domain Adaptation (Huang et al., 2024) face challenges in low-resource settings (Chang et al., 2023b;Li et al., 2024), with risks like training instability and catastrophic forgetting (Kirkpatrick et al., 2017). This motivates our novel pipeline, focusing on language heterogeneity, generalization, and plasticity. Vocabulary construction is crucial in multilingual pre-training. Techniques include tokenization with a temperature setting (Devlin et al., 2019) and language-clustered vocabularies (Chung et al., 2020), though the latter requires predefined clusters. Active forgetting (Chen et al., 2023), a related approach, enhances model plasticity by periodically re-initializing embeddings, easing adaptation to new languages.

Section: CONCLUSION
We investigated pre-training Language Models (LMs) under data heterogeneity, proposing an efficient and robust pipeline, DEPT, which supports training under diverse data sources while mitigating Negative Interference and the Curse of Multilinguality. The core of DEPT is decoupling the embedding space from the transformer body during pre-training, offered in three variants with varying degrees of separation. Experiments showed that DEPT (1) allows training across heterogeneous data efficiently, (2) reduces the memory footpring of token embedding matrices by 4 -5×, (3) improves model generalization and plasticity with lower perplexity on validation and out-of-distribution test datasets, and (4) supports custom vocabularies per data source, enabling vocabulary agnostic federated pre-training, which we have tested up to billion-scale models and intend to push further.

Section: LIMITATIONS & FUTURE WORK
DEPT offers a pre-training framework intended to precede further adaptation or fine-tuning. However, DEPT models require a final global embedding for practical use. The GLOB and TRIM variants provide this at the end of pre-training, while SPEC does not, suggesting future work on embedding generation methods, such as zero-shot embedding transfer (Mosin et al., 2023), vocabulary matching (Xu et al., 2024) and model stitching (Moschella et al., 2023).

Section: A EXPERIMENTAL DETAILS
A.1 MODEL ARCHITECTURES AND HYPERPARAMETERS Table 8 presents the vocabulary-agnostic hyperparameters of our decoder-only models, while Table 9 details vocabulary sizes, DEPT-specific parameters, memory costs, and communication costs. Standard pre-training pipeline parameters were chosen based on the recommendations of Hoffmann et al. (2022) and MosaicML, except for the billion-scale model, where we aligned with the recent state-of-the-art (SOTA) for English federated pre-training by Sani et al. (2024). We always use a gradient clipping norm of 1 and ALiBi (Press et al., 2022) positional embeddings.
During continued pre-training, for models initialized randomly, we begin with η max and decay over N CT learning steps, allowing quick embedding matrix learning without requiring another full training pass, as is common in language rewiring (Artetxe et al., 2020). When using pre-initialized models, we start from η max /2 since both the model and embeddings are reasonably well-trained.
Importantly, the only parameter changed between DEPT models and baselines is the learning rate η max . We use the same learning rate to contrast convergence properties for comparisons in Fig. 2. We tune the baselines' learning rate for later comparisons to ensure they perform the same number of training steps, selecting the best checkpoint for a baseline across all experiments. Except for tuning the learning-rate, DEPT models always use the same hyperparameters as the baselines during local training.
Table 8: Architectural details and vocabulary-independent hyperparameters of our models. The number of transformer blocks is denoted by #Blocks, the number of attention heads by #Heads, and the expansion ratio refers to the ratio of the hidden dimension in the feedforward layers. The total number of model parameters is M, the vocabulary size is |V|, and the model embedding dimension is d model . We train standard decoder-only transformers whose body ranges in size from 86.4M to 1.2B independent of embeddings. As we see in Table 9, the size of the embedding matrix can change the model size drastically. Our batch size is |B| while |S t |/|S| is our sampling ratios for the various data sources. The β 1 , β 2 pair are AdamW parameters while the S c tuple represents the parameters of the cosine scheduler that we use, including the decay alpha α, the decay period η max , and the total number of sequential steps N . Finally, we show the number of continued pre-training steps N CT that we use, representing 15% of total steps for the 298M model and 19.3% for the 86.4M model. All of our models use a sequence length of 2048. We followed the hyperparameters of Sani et al. (2024) for the billion-scale federated pre-training. We report the tuned η max , for each baseline according to Appendix A.1.2, η
STD(τ =0) max , η STD(τ =0.3) max , η STD(τ =1) max
, we find that the embedding resting allows ACT to use the same η max as DEPT. 
(10 -1 , 2 × 10 -4 , 7 × 10 4 ) - 1 × 10 -4 1 × 10 -4 1.5 × 10 -4
We had to select a particular sampling ratio for the continued pre-training using the full pre-training set rather than a single language or domain. Due to its high heterogeneity, we default to uniform sampling for MC4 in these cases. In contrast, for The Pile, we preferred proportional sampling as the dataset is entirely in English and has already had its data sources upsampled/downsampled based on usefulness. We also provide results using the alternative sampling policy in Appendix B.

Section: A.1.1 SOFTWARE AND HARDWARE
Our software is based on the MosaicML composer (Databricks, 2024) library for LLM pre-training and the open-source Flower (Beutel et al., 2022) framework for federated learning. Crucially, we heavily rely on the MosaicML hyperparameters and infrastructure for our InnerOPT, making no changes to it after our embedding-matrix manipulation from Algorithm 1 has been performed. For the standard baselines, we ran them on a completely unmodified version of the MosaicML codebase (beyond using our data), which has been independently verified by thousands of users and used to submit accepted conference publications (Blakeney et al., 2024). In terms of hardware, the low communication properties of DEPT allowed us to run experiments via a mixture of loaned resources from separate cloud providers. Over the course of our experimentation, we used various machines equipped with either 1 H100 or 1 A100 GPU in the USA, Canada, and Europe, which turned out to be more cost-effective. We rented machines with 4-8 H100 GPUs for the centralized baselines since we could not use Distributed Data Parallelism techniques over lowbandwidth internet connections. When the standard training baseline has a sufficiently low learning rate to converge, the difference in training time is driven by three factors.
First, the throughput achieved by individual workers: for GLOB, this should be identical to standard pre-training as the model in memory remains unchanged. For TRIM and SPEC, the reduced memory requirements may allow increasing the device micro-batch size in certain scenarios (but not the global batch size, which heavily influences optimization properties). This depends heavily on the hardware; for example, in DeepMind Mathematics workloads, TRIM or SPEC can double the device micro-batch size, and similarly for SPEC-OPT in the case of multilingual data.
Second, the communication topology significantly impacts wall clock time. For instance, in a 10 Gbps bandwidth connection using Ring AllReduce for aggregation across workers, DEPT can reduce training time by 33% for a 1 billion parameter model. In cases with a very fast connection, such as InfiniBand, the training time difference is primarily determined by throughput differences.
Third, the number of local data sources and the number of available workers impact the total training time, for DEPT we always scale the number of workers to match the number of data sources exactly.

Section: A.1.2 HYPERPARAMETER TUNING METHODOLOGY
Given that MosaicML provides hyperparameter-tuned models on the C4 (Raffel et al., 2020) dataset, we use their learning rate schedule and number of training steps as a starting point. In the case of DEPT, we find that we can always use the MosaicML parameters since the OuterOpt application of DEPT acts as a regulariser via noise-injection (Lin et al., 2020) and meta-learning effects (Nichol et al., 2018). This makes DEPT models highly unlikely to diverge, even under extreme data heterogeneity and without a shared input or output space. In the case of standard training baselines, we gradually lower the learning rate, starting from the one reported in Table 8.
We begin with the maximum learning rate η max and systematically reduce it on a coarse grid in intervals of 0.5 × 10 -5 :
η = η max -0.5k × 10 -5 , k ∈ {0, 1, 2, . . . , K},
where k represents the step index, and K is chosen such that η > 0 at the final step. Given that the length of the cosine cycle is directly extrapolated from known scaling laws on the number of tokens that the model needs to train on for compute-optimality (Hoffmann et al., 2022), approximately 20 tokens per parameter, we stop as early as we find a learning rate that can complete the entire cosine schedule. Then, we choose the best-performing checkpoint, according to validation perplexity, across all experiments. We report these values in Table 8.
This hyperparameter search does not cover all possible relevant parameters; given enough resources, we would also tune the gradient clipping norm. Furthermore, we could tune the batch size using the empirical model of large-batch training proposed by McCandlish et al. (2018). Given that the appropriate learning rate depends on the chosen batch size and the desired target loss, such an optimization would require hundreds of experiments across all baselines to find an optimal configuration.

Section: A.1.3 ADAPTING ACTIVE FORGETTING
To implement the active forgetting baseline (Chen et al., 2023), ACT, we had to adapt the methodology to decoder-only models, which train with far fewer steps. To achieve this, we use a forgetting frequency of 500 steps, equal to DEPT's N local . We also use a cosine scheduler for the body with the same parameters as shown in Table 8; however, we schedule the embedding matrix independently across the 500 steps using the same scheduler but setting η ′ max = 500. Finally, we selected the checkpoint with the lowest validation perplexity for continued pre-training in a forgetting cycle.

Section: A.2 DATA SOURCES
We quantify the lexical heterogeneity of a dataset based on lexical similarity between data sources. A simple similarity measure is the size of the intersection of subwords between vocabularies. The smaller the intersection, the more dissimilar the vocabularies, and thus, the more challenging it becomes to train a shared tokenizer effectively across different domains or languages. For this section, we use the size of local vocabulary as a subset of the global vocabulary as a proxy, with smaller local vocabulary indicating that global tokenization does not serve a particular data source well.
Our default global tokenizer for multilingual data is that proposed by Xue et al. (2021), with V = 250 112.0 tokens. Owing to its diverse pre-training, the mT5 (Xue et al., 2021) tokenizer is a robust default choice, employed in recent works such as project Aya ( Üstün et al., 2024). However, its coverage of hundreds of languages does come with many shortcomings relating to the capacity allocated to each language. To showcase these challenges, we carefully selected languages from distinct families in the MC4 subset, including English (EN), Italian (IT), Serbian (SR), Swahili (SW), Urdu (UR), Latin (LA), Chinese (ZH), and Malay (MS). The corresponding vocabulary sizes of our languages are as follows: {247 720, 211 332, 208 391, 170 984, 188 002, 220 757, 240 566}. Among these, Swahili (SW) is the most heterogeneous, as determined by its small subset of 170 984 tokens.
Our global tokenizer for English data was trained on The Pile (Gao et al., 2021) and proposed by Black et al. (2022) with V = 50 257 tokens. We selected The Pile as our multidomain dataset for several reasons. The Pile is a diverse, large-scale dataset specifically designed for training large language models (LLMs). Its diversity spans domains such as scientific papers, news, books, and web content, providing a comprehensive foundation for capturing varied linguistic patterns. Among the various subsets of The Pile, DM Mathematics stands out as the most heterogeneous. This subset contains only 11, 090 tokens from the global vocabulary, significantly fewer than other subsets. Here are the sizes of other subsets in terms of their unique tokens from the global vocabulary: {49 362, 49 783, 46 766, 49 469, 49 700, 47 865, 48 720} {11 090, 44 249, 42 957, 44 432, 49 992, 49 841, 47 687, 49 961, 46 825}. While this indicates much lower heterogeneity than in multilingual settings, vocabulary choice may still impact highly specialized model capabilities such as mathematical reasoning.

Section: A.2.1 TOKENIZATION CONSIDERATIONS
One of the major challenges when representing multiple data sources with a single tokenizer is vocabulary dilution. To maximize coverage, a tokenizer that aims to cover multiple languages or domains often needs to adopt many short subwords. This increases the tokenizer fertility (i.e., the number of tokens produced per unit of text) (Rust et al., 2021) and also raises the overall description length -the total number of tokens required to represent the same data. This trade-off negatively affects the compression ratio, as the same amount of information requires more tokens, reducing the model's sample efficiency (Tao et al., 2024). When non-uniform sampling ratios are used during pre-training, high-resource languages tend to have better fertility than low-resource languages. This means high-resource languages are better represented in the vocabulary, and their tokens are more likely to be shared across the model's parameters, improving their performance. In contrast, low-resource languages suffer from poor fertility, where their unique vocabulary tokens are underrepresented, leading to worse performance. For example, Swahili (SW) and Urdu (UR) are low-resource languages that face these challenges. Our SPEC method allows us to avoid many of these issues by providing an optimized vocabulary to a data source at the cost of losing a shared vocabulary and updating several embedding matrices. An alternative approach is to cluster vocabularies (Chung et al., 2020) to obtain subword sharing between more relevant languages. However, this requires that participating data sources are known in advance, do not change significantly, and that the appropriate number of clusters is also known.
To account for the effectiveness of a tokenizer on a given language, we report unigram cross-entropy in our experiments, which represents how effective a simple unigram model based on the tokenizer is on that data source as a proxy for the effectiveness of the tokenization. If the unigram cross-entropy is high on a given data source, it is likely underserved by the tokenization. Thus, all improvements brought about by using a more complex language model must consider this baseline. It can also be used to compute unigram-normalized cross-entropy or perplexity, a language modeling performance metric that is comparable across different vocabulary sizes (Tao et al., 2024). For the initial rounds, we sample 4 data sources out of 8; after seeing most of the clients, we reduce the number to 2. We make sure only to introduce EN later into the experiment.

Section: B ADDITIONAL RESULTS
Figure 4 provides further insights into the performance of DEPT on a larger-scale experiment with a 1.3 billion-parameter model. In this setting, the model is trained in a vocabulary-agnostic, federated fashion with dynamic client subsampling. During the initial rounds, 4 out of 8 data sources are sampled, which is reduced to 2 after most clients have been processed. Importantly, EN is in-troduced later in the training process to evaluate the model's cross-lingual transfer capabilities to this high-resource language. The plot illustrates that the transformer body, enabled by DEPT, effectively transfers knowledge across languages and domains, allowing newly introduced or previously stale data sources to converge to perplexity levels similar to their peers within one or two sampling rounds. This experiment underscores the feasibility and scalability of using DEPT for collaborative large-scale language model pre-training, even under extreme client subsampling and without prior knowledge of the underlying data distribution.   The results presented in Figure 5 demonstrate the robustness and adaptability of DEPT across various settings, completing the plot shown in Fig. 3. Specifically, DEPT consistently achieves the lowest perplexity across all scenarios: (1) the full pre-training distribution (MC4-FULL), (2) the lowest-resource language within the dataset (SW), and (3) two out-of-distribution languages (HI and DE). Furthermore, DEPT is not only effective in reaching convergence but also does so at a faster rate compared to other approaches. These results showcase its utility in a wide range of multilingual and domain-adaptive pre-training tasks; for example, if a new client were to be introduced in a federated setting, they show that the DEPT trained model could quickly adapt to its data distribution. Alternatively, multi-phase adaptive pre-training represents a distinct advantage in terms of data efficiency.

Section: B.2 PLASTICITY


Section: B.3 IID DATA PERFORMANCE
In the case of IID data (represented by a random sharding of the C4 dataset), Fig. 6 shows that DEPT performs similarly to standard pre-training with the benefit of lower activation norms, indicating the potential for longer and more training.   

Section: B.5 TRANSFORMER BODY GENERALIZATION
Table 10 shows the performance of DEPT on the The Pile dataset with a 24-block model trained from randomly initialized embeddings. Here, DEPT outperforms all baselines across all subsets, with average improvements of 17.5%.

Section: B.6 PRE-TRAINED EMBEDDING MATRIX GENERALIZATION
Table 11: Validation perplexity (↓) for our 24-block models trained on The Pile when performing continued pre-training with uniform sampling starting from a pre-trained embedding matrix. DEPT wins 10 out of 17 comparisons with TRIM always outperforming GLOB. When comparing against Tables 5 and10, we can observe that DEPT wins the complementary comparisons when starting from random embeddings or when using proportional sampling with the pre-trained embedding matrices. This indicates that baselines always have a worse transformer body, with sampling ratios heavily impacting the effectiveness of embeddings for a given dataset. When continuing pre-training with pre-trained embedding matrices, as shown in Table 11, DEPT secures 10 out of 17 wins, with TRIM consistently outperforming GLOB. A comparison with Tables 10 and 5 reveals that DEPT also outperforms in other scenarios, whether starting from random Table 13: Validation perplexity (↓) for our 12-block models trained on The Pile when performing continued pre-training starting from a pre-trained embedding matrix. DEPT performs worse than for the 24-block trained on The Pile and than for our M C4 models. However, when considering Table 12, we can observe it wins all comparisons by wide margins when starting from a randomly initialized embedding matrix, indicating that this gap is driven by the embedding space being fitted to the high-resource languages despite the baselines having a worse transformer body. Min Imp (%) 5.4 -7.0 -1.2 2.0 -11.2 -10.5 -6.7 -8.3 -6.8 -5.1 -13.8 3.0 -5.5 -14.3 -5.6 0.9 -32.7 -3.7 Max Imp (%) 7.5 -4.0 1.0 4.3 -7.9 -7.5 -2.9 -6.2 -4.7 -2.6 -11.4 5.6 -4.2 -11.5 -2.3 3.8 -18.6 0.3 embeddings or leveraging pre-trained ones. This consistency underscores the robustness of DEPT's transformer body across varying embedding initialization and sampling strategies.

Section: B.7 SCALING EXPERIMENTS
Here, we train smaller multi-domain models with 12 blocks to validate the scaling properties of DEPT across model sizes. In Table 12, we observe that, similar to Table 10, DEPT models outperform all baselines significantly when starting from random initialization. Importantly, the embeddings constitute a larger percentage of the model parameters at this model scale. This highlights the robustness of DEPT's modifications to the embedding space in enabling the training of a better transformer body.
Interestingly, when using pre-trained embeddings (Table 6), the smaller DEPT models perform worse than their larger counterparts. We speculate that the amount of local per-source training performed by DEPT prior to OuterOpt should scale with model size. At this scale, the aggregation procedure may be overly harsh on the embedding parameters, particularly for the GLOB and TRIM configurations. This suggests that careful adjustments to the aggregation procedure may be necessary to maintain DEPT's effectiveness at smaller model scales.

Section: B.8 COMPARISON AGAINST SINGLE-CLIENT MODELS
To study the impact of model averaging, we now compare DEPT-based models with models trained on isolated data sources that are never averaged/merged. For fair comparisons, the model of each data source has seen as many tokens as it would have as a component in DEPT-based training and has undergone continued pre-training for the same number of steps with access to the full dataset.
Additionally, DEPT models have undergone continued pre-training from random initialization. If we had compared against such models without the continued pre-training phase, they would have dominated on their respective data source while losing all other comparisons, especially in the case of multilingual data.
Tables 14 and15 show how DEPT models perform when all participants start from random initialization. Since the models trained on isolated data sources do not get to keep their highly specialized embeddings, this comparison evaluates how generalizable the abstractions learnt by the transformer Table 14: Validation perplexity (↓) for 24-block models trained on The Pile after continued pretraining with proportional sampling from randomly-initialized embeddings, compared to models which had been pre-trained on a single data source for the same total number of tokens as DEPT has seen from their distributions. DEPT outperforms all baselines. DEPT outperforms all baselines.
Baselines whose pre-training dataset matches the evaluation dataset are highlighted in olive. body are across datasets. In the case of The Pile, shown in Table 14, DEPT models outperform the isolated baselines by 7.8% in terms of average perplexity. Crucially, DEPT models win all comparisons even though isolated baselines get evaluated on their pre-training dataset, indicating that they have not learned superior abstractions even in this case. For MC4, Table 15 show a very similar trend with a much higher degree of outperformance for DEPT, 9.8% on average on in-distribution data and 16.7% for out-of-distribution (OOD) data, likely because the transformer body learned for one language has significant difficulty in adapting to a multilingual context.
Tables 16 and17 show the impact of keeping the pre-trained embeddings before continued pretraining. The impact of this change is as expected: embeddings pre-trained on a specific dataset perform well on that dataset; however, they fail to generalize. In the case of The Pile, shown in Table 16, DEPT loses most comparisons to the baseline trained on a given dataset; however, it outperforms in terms of average perplexity by a remarkable 27%. For MC4, shown in Table 17, the outperformance in terms of average perplexity is even more significant, 30.6% for in-distribution data and 14.9% of OOD data.
Table 16: Validation perplexity (↓) for 24-block models trained on The Pile after continued pretraining with proportional sampling from pre-trained embeddings, compared to models which had been pre-trained on a single data source for the same total number of tokens as DEPT has seen from their distributions. DEPT significantly outperforms in terms of average perplexity but gets beaten by specialized models on their respective data source. Baselines whose pre-training dataset matches the evaluation dataset are highlighted in olive. The experiments in our work are designed to investigate the outlined research questions instead of producing a state-of-the-art model. However, we believe that providing a comparison against a standard baseline may help better contextualize the performance of a given DEPT model. For this purpose, we chose Pythia (Biderman et al., 2023) as it shares a very similar architecture to the one used in our work, with the only exception being that Pythia uses untied weights for the embedding matrix and thus all its equivalent sizes have more model parameters. Pythia models are trained on one epoch of the entire The Pile, 300B tokens, regardless of size. Thus, we do not have any OOD dataset for them, and they are expected to perform better on Ubuntu IRC. Since they are trained for many more tokens than DEPT models, we do not perform additional continued pre-training when starting from a pre-trained embedding matrix (the extra tokens existed to equalize the amount of work done across baselines); thus, those comparisons show the raw performance of Pythia as published by its authors. When starting from a random initialization we use the standard procedure from above.
Table 18 shows a comparison between a 160M Pythia model and the 125M DEPT models when starting from pre-trained embeddings. At this scale, the additional pre-training of Pythia (using 30× the tokens of DEPT) does not provide an evident advantage as the model capacity is insufficient to benefit from it. Thus, outside of the expected outperformance on Ubuntu IRC (UI), Pythia-160M performs similarly to DEPT models and is slightly outperformed on average. We also speculate that using the full 22-dataset version of The Pile during pre-training likely reduced the performance of Pythia-160M as it had to fit a broader data distribution. We do not provide random initialization results for this model size since we found it impossible to make it behave well during continued pre-training, and we believe the comparison would be unfair. Tables 19 and20 show the expected outperformance of the 410M Pythia model over the DEPT models, as this size has sufficient capacity to benefit from the extensive (10× longer compared to DEPT) pre-training. When starting from a random initialization, Table 19, the best DEPT variant is within 1 average perplexity point of Pythia-410M, indicating that a large portion of the additional token budget is primarily used to obtain better embeddings without providing a significantly improved transformer body. When starting from pre-initialized embeddings, Table 20, Pythia-410M significantly outperforms DEPT achieving an average perplexity 10 points lower than the best DEPT variant. As discussed above, this is driven by its more extensive pre-training and improved embeddings.

Section: C APPLICATIONS C.1 FEDERATED PRE-TRAINING OF LLMS ON MULTILINGUAL POPULATION
The challenges of training under data heterogeneity have come back into focus with recent forays into federated pre-training (Douillard et al., 2023;Sani et al., 2024;Charles et al., 2023;Nous Research, 2024), triggered in equal parts by privacy concerns, compute sharing and the search for more data in previously untapped reservoirs.
The way in which datasets are curated, filtered, and combined has a significant impact on the performance of LLMs (Long et al., 2024). Determining the best methods for data curation, filtering, and mixing from various sources requires extensive experimentation to identify configurations that optimize performance on target evaluation metrics (Meta, 2024). Consequently, the specific details of these processes are often closely guarded by leading LLM developers. Despite careful dataset preparation, data heterogeneity remains inevitable due to the inherent imbalance in data sources. One of the most prominent imbalances is in language representation. For instance, only about 5% of the pre-training data for Llama3 is non-English, covering over 30 languages, which results in lower expected performance in non-English contexts (Meta, 2024). A similar performance disparity across languages has also been observed with GPT-4 (Achiam et al., 2023).
Current datasets used for pre-training are highly geographically concentrated to a few areas of the globe (Faisal et al., 2022), providing the so-called high-resource languages, with high-quality domain-specific data being available predominantly in such languages (Magueresse et al., 2020). Such datasets are collected from internet sources and then curated (Brown et al., 2020;Dubey et al., 2024). However, bottlenecks in the rate of high-quality data generation (Villalobos et al., 2022) and copyright concerns (Grynbaum & Mac, 2023) have led to large organizations making deals with private data providers such as publishers (OpenAI, 2023;Patel & Palazzolo, 2024) in order to meet the demand of ever-growing models.
Federated pre-training as a methodology allows the model to be taken directly to the training data, potentially enabling training under privacy concerns or legislation that limits data movement (Woisetschläger et al., 2024). While this has obvious applications for collaborative training of LMs, it can also be applied by a single organization as a drop-in replacement for mini-batch SGD during pre-training (Douillard et al., 2023), which eliminates dataset movement while massively lowering the communication frequency of model training compared to Data-parallel algorithms (Rajbhandari et al., 2020) which need to synchronize gradients every batch. The version of the algorithm used in a centralized setting, mathematically equivalent to Federated Averaging (McMahan et al., 2017a), has alternatively been known as: (a) communication-efficient SGD (Yu et al., 2019), (b) Local SGD (Stich, 2019;Ortiz et al., 2021), or (c) as a specific variant of the REPTILE (Nichol et al., 2018) meta-learning algorithm. Under these various methodologies, it has been shown to (a) confer a linear speedup to convergence similar to increasing batch size, (b) provide better generalization to models compared to standard large-batch training, (c) enable meta-learning across various tasks.
While the current centralized pre-training recipe may be stabilized with great effort, such measures are largely impractical in federated training scenarios where the participants refuse to offer full control over their data to a third party or where the underlying training distribution may shift as new participants enter a federation or old ones exit. Furthermore, the complexity of the current pipeline is impractical to all except the best-funded organizations, even in a centralized training context.
The inability to directly inspect data sources in a federated context makes it impossible to construct a dedicated vocabulary for a data mixture, ensure a standard curation pipeline on a per-sample basis, or strongly control data sampling rates across all sources. Motivated by this extreme setting, we aim to construct a pre-training procedure that is capable of learning from multiple highly heterogeneous data sources without model divergence while providing a foundation model with greater generalization and more plasticity in adapting to new data.

Section: D TRAINING UNDER DATA HETEROGENEITY
Training LLMs such as Llama 3 (Dubey et al., 2024) requires extensive manual tuning, heuristics, and model-based data selection procedures. This effort aims to achieve the desired mix of categories, such as general knowledge, mathematics, coding, and multilingual data.
This complexity arises due to the wide range of capabilities required by LMs and the risk of negative interference across domains and languages. Current pre-training methodologies are prone to divergence unless data sampling ratios can be meticulously curated based on the characteristics of the data and its fit to the model's distribution at any given time (Dubey et al., 2024). Multi-domain ratios are manually curated for downstream performance, requiring extensive and expensive tuning, while multilingual pre-training often employs temperature-weighted sampling (Devlin et al., 2019;Conneau et al., 2020;Xue et al., 2021) due to the vast number of languages involved, As illustrated in Fig. 2, pre-training on heterogeneous data can result in model activation divergence (Hoffmann et al., 2022), even with a sampling temperature of τ = 1.0, which corresponds to proportional sampling based on dataset size. Activation divergence is a precursor to significant, often irrecoverable, increases in loss, and necessitating model re-starts from earlier checkpoints with lower learning rates (Zhang et al., 2022). Longer training durations could be achieved by disproportionately sampling from larger, lower-quality datasets like C4 or high-resource languages like English in multilingual pre-training. Alternatively, methods like active forgetting via embedding resetting (Chen et al., 2023), ACT, may artificially extend the training duration past the natural divergence point.
Previous studies show that this Curse of Multilinguality and/or Negative Interference can be attributed to vocabulary dilution and capacity contention (Conneau et al., 2020), language-specific parameter emergence (Wang et al., 2020), and suboptimal tokenization (Rust et al., 2021). Increasing model and vocabulary size helps capacity contention (Conneau et al., 2020;Wang et al., 2020), but this requires immense hardware resources (Dubey et al., 2024) to shard the model across multiple GPUS. Addressing vocabulary dilution in highly multilingual models is even more challenging, as providing enough tokens for all languages would result in impractically large models (Rust et al., 2021). These limitations drive us to find scalable methods to incorporate broader data mixtures without significantly increasing the in-memory model size during training.

Section: E FINE-TUNING DEPT MODELS
We evaluate fine-tuning performance on three downstream tasks: RACE, MNLI, and STSB. All models are fine-tuned using the recipes provided by Radford et al. (2018) for each task using the AdamW optimizer with a linear learning rate scheduler. For RACE, the model is trained for 5 epochs with a learning rate of 6e-5 and a batch size of 16. For MNLI, fine-tuning is performed over 2 epochs with a learning rate of 4e-5 and a batch size of 32. Finally, STSB is fine-tuned for 5 epochs using a learning rate of 2e-5 and a batch size of 32. The results are reported in Table 21.
Table 21: The performance on downstream tasks (↑), following continued pre-training, shows that DEPT models achieve 3% -7.5% relative improvements over the baselines, with TRIM delivering the best results. DEPT consistently outperforms baselines, even with pre-trained embedding initialization, underscoring the importance of an effective transformer body.
broadest pre-training one, as demonstrated in this work. Alternatives include vocabulary/embedding transfer (Remy et al., 2024) or vocabulary matching (Xu et al., 2024). If these methods fail to reach the desired performance, additional optimization may be necessary to align the embeddings with the transformer body.

Section: ACKNOWLEDGMENTS
All costs for the computation used for this work was funded by Flower Labs, and the research conducted by a team of researchers from Flower Labs and The University of Cambridge. Support for university-based researchers came from a variety of sources, but in particular, the following funding organizations are acknowledged: the European Research Council (REDIAL), the Royal Academy of Engineering (DANTE), and the Ministry of Education of Romania through the Credit and Scholarship Agency.

Section: F USING SPEC MODELS FOR INFERENCE
As discussed in Sections 2.4, 3.5 and 6.1, SPEC models do not inherently support inference on a broad corpus after initial pre-training. Suppose local vocabularies and embedding matrices are available without privacy concerns. In that case, inference can be performed using the embedding matrix of the broadest data source or the one closest to the target application. For instance, targeting English text would utilize EN embeddings for MC4 or CC embeddings for The Pile. While effective, this limits generalization beyond the broadest dataset in the pre-training distribution.
To handle a corpus resembling a mixture of all pre-training data sources or unseen ones, SPEC models require a broader embedding matrix for good performance. This can be achieved through multi-phased adaptive or continued pre-training, starting with a random embedding matrix or the


References:
[b0] Josh Achiam; Steven Adler; Sandhini Agarwal; Lama Ahmad; Ilge Akkaya; Florencia Leoni Aleman; Diogo Almeida; Janko Altenschmidt; Sam Altman; Shyamal Anadkat (2023). . 
[b1] Rohan Anil; Andrew M Dai; Orhan Firat; Melvin Johnson; Dmitry Lepikhin; Alexandre Passos; Siamak Shakeri; Emanuel Taropa; Paige Bailey; Zhifeng Chen (2023). Palm 2 technical report. 
[b2] Mikel Artetxe; Sebastian Ruder; Dani Yogatama (2020). On the cross-lingual transferability of monolingual representations. Association for Computational Linguistics
[b3] J Daniel; Taner Beutel; Akhil Topal; Xinchi Mathur; Javier Qiu; Yan Fernandez-Marques; Lorenzo Gao;  Sani; Hei Kwing; Titouan Li; Pedro Parcollet; Nicholas D Porto Buarque De Gusmão;  Lane (2022). Flower: A friendly federated learning research framework. 
[b4] Stella Biderman; Hailey Schoelkopf; Quentin Gregory Anthony; Herbie Bradley; O' Kyle; Eric Brien; Mohammad Hallahan; Shivanshu Aflah Khan;  Purohit; Edward Usvsn Sai Prashanth; Aviya Raff; Lintang Skowron; Oskar Sutawika;  Van Der Wal (2023). Pythia: A suite for analyzing large language models across training and scaling. PMLR
[b5] Sid Black; Stella Biderman; Eric Hallahan; Quentin Anthony; Leo Gao; Laurence Golding; Horace He; Connor Leahy; Kyle Mcdonell; Jason Phang; Michael Pieler; Shivanshu Usvsn Sai Prashanth; Laria Purohit; Jonathan Reynolds; Ben Tow; Samuel Wang;  Weinbach (2022). Gpt-neox-20b: An open-source autoregressive language model. 
[b6] Cody Blakeney; Mansheej Paul; Brett W Larsen; Sean Owen; Jonathan Frankle (2024). Does your data spark joy? performance gains from domain upsampling at the end of training. 
[b7] Eleftheria Briakou; Colin Cherry; George Foster (2023). Searching for needles in a haystack: On the role of incidental bilingualism in palm's translation capability. 
[b8] B Tom; Benjamin Brown; Nick Mann; Melanie Ryder; Jared Subbiah; Prafulla Kaplan; Arvind Dhariwal; Pranav Neelakantan; Girish Shyam; Amanda Sastry; Sandhini Askell; Ariel Agarwal; Gretchen Herbert-Voss; Tom Krueger; Rewon Henighan; Aditya Child; Daniel M Ramesh; Jeffrey Ziegler; Clemens Wu; Christopher Winter; Mark Hesse; Eric Chen; Mateusz Sigler; Scott Litwin; Benjamin Gray; Jack Chess; Christopher Clark; Sam Berner; Alec Mccandlish; Ilya Radford; Dario Sutskever;  Amodei (2020). Language models are few-shot learners. 
[b9] M Daniel; Mona T Cer; Eneko Diab; Iñigo Agirre; Lucia Lopez-Gazpio;  Specia (2017). Semeval-2017 task 1: Semantic textual similarity -multilingual and cross-lingual focused evaluation. 
[b10] Tyler A Chang; Catherine Arnett; Zhuowen Tu; Benjamin K Bergen (2023). When is multilinguality a curse? language modeling for 250 high-and low-resource languages. 
[b11] Catherine Tyler A Chang; Zhuowen Arnett; Benjamin K Tu;  Bergen (2023). When is multilinguality a curse? language modeling for 250 high-and low-resource languages. 
[b12] Zachary Charles; Nicole Mitchell; Krishna Pillutla; Michael Reneer; Zachary Garrett (2023-12-10). Towards federated foundation models: Scalable dataset pipelines for group-structured learning. 
[b13] Yihong Chen; Kelly Marchisio; Roberta Raileanu; David Ifeoluwa Adelani; Pontus Lars; Erik Saito Stenetorp; Sebastian Riedel; Mikel Artetxe (2023). Improving language plasticity via pretraining with active forgetting. 
[b14] Rochelle Choenni; Dan Garrette; Ekaterina Shutova (2023). How do languages influence each other? studying cross-lingual data sharing during llm fine-tuning. 
[b15] Aakanksha Chowdhery; Sharan Narang; Jacob Devlin; Maarten Bosma; Gaurav Mishra; Adam Roberts; Paul Barham; Hyung Won Chung; Charles Sutton; Sebastian Gehrmann; Parker Schuh; Kensen Shi; Sasha Tsvyashchenko; Joshua Maynez; Abhishek Rao; Parker Barnes; Yi Tay; Noam Shazeer; Emily Vinodkumar Prabhakaran; Nan Reif; Ben Du; Reiner Hutchinson; James Pope; Jacob Bradbury; Michael Austin; Guy Isard; Pengcheng Gur-Ari; Toju Yin; Anselm Duke; Sanjay Levskaya; Sunipa Ghemawat; Henryk Dev; Xavier Michalewski; Vedant Garcia; Kevin Misra; Liam Robinson; Denny Fedus; Daphne Zhou; David Ippolito; Hyeontaek Luan; Barret Lim; Alexander Zoph; Ryan Spiridonov; David Sepassi; Shivani Dohan; Mark Agrawal; Andrew M Omernick; Thanumalayan Dai; Marie Sankaranarayana Pillai; Aitor Pellat; Erica Lewkowycz; Rewon Moreira; Oleksandr Child; Katherine Polozov; Zongwei Lee; Xuezhi Zhou; Brennan Wang; Mark Saeta; Orhan Diaz; Michele Firat; Jason Catasta; Kathy Wei; Douglas Meier-Hellstern; Jeff Eck; Slav Dean; Noah Petrov;  Fiedel (2023). Palm: Scaling language modeling with pathways. J. Mach. Learn. Res
[b16] Chung Hyung Won; Dan Garrette; Kiat Chuan Tan; Jason Riesa (2020). Improving multilingual models with language-clustered vocabularies. Association for Computational Linguistics
[b17] Alexis Conneau; Guillaume Lample (2019). Cross-lingual language model pretraining. Advances in neural information processing systems
[b18] Alexis Conneau; Kartikay Khandelwal; Naman Goyal; Vishrav Chaudhary; Guillaume Wenzek; Francisco Guzmán; Edouard Grave; Myle Ott; Luke Zettlemoyer; Veselin Stoyanov (2020). Unsupervised cross-lingual representation learning at scale. Association for Computational Linguistics
[b19]  Databricks (2024). mosaic research. 
[b20] Mostafa Dehghani; Josip Djolonga; Basil Mustafa; Piotr Padlewski; Jonathan Heek; Justin Gilmer; Andreas Peter Steiner; Mathilde Caron; Robert Geirhos; Ibrahim Alabdulmohsin; Rodolphe Jenatton; Lucas Beyer; Michael Tschannen; Anurag Arnab; Xiao Wang; Carlos Riquelme Ruiz; Matthias Minderer; Joan Puigcerver; Utku Evci; Manoj Kumar; Sjoerd Van Steenkiste; Gamaleldin Fathy Elsayed; Aravindh Mahendran; Fisher Yu; Avital Oliver; Fantine Huot; Jasmijn Bastings; Mark Collier; Alexey A Gritsenko; Vighnesh Birodkar; Cristina Nader Vasconcelos; Yi Tay; Thomas Mensink; Alexander Kolesnikov; Filip Pavetic; Dustin Tran; Thomas Kipf; Mario Lucic; Xiaohua Zhai; Daniel Keysers; Jeremiah J Harmsen; Neil Houlsby (2023). Scaling vision transformers to 22 billion parameters. PMLR
[b21] Jacob Devlin; Ming-Wei Chang; Kenton Lee; Kristina Toutanova (2019). BERT: pre-training of deep bidirectional transformers for language understanding. Association for Computational Linguistics
[b22] Bosheng Ding; Chengwei Qin; Ruochen Zhao; Tianze Luo; Xinze Li; Guizhen Chen; Wenhan Xia; Junjie Hu; Anh Tuan Luu; Shafiq Joty (2024). Data augmentation using llms: Data perspectives, learning paradigms and challenges. 
[b23] Arthur Douillard; Qixuang Feng; Andrei A Rusu; Rachita Chhaparia; Yani Donchev; Adhiguna Kuncoro; Marc'aurelio Ranzato; Arthur Szlam; Jiajun Shen (2023). Diloco: Distributed lowcommunication training of language models. 
[b24] Abhimanyu Dubey; Abhinav Jauhri; Abhinav Pandey; Abhishek Kadian; Ahmad Al-Dahle; Aiesha Letman; Akhil Mathur; Alan Schelten; Amy Yang; Angela Fan; Anirudh Goyal; Anthony Hartshorn; Aobo Yang; Archi Mitra; Archie Sravankumar; Artem Korenev; Arthur Hinsvark; Arun Rao; Aston Zhang; Aurélien Rodriguez; Austen Gregerson; Ava Spataru; Baptiste Rozière; Bethany Biron; Binh Tang; Bobbie Chern; Charlotte Caucheteux; Chaya Nayak; Chloe Bi; Chris Marra; Chris Mcconnell; Christian Keller; Christophe Touret; Chunyang Wu; Corinne Wong; Cristian Canton Ferrer; Cyrus Nikolaidis; Damien Allonsius; Daniel Song; Danielle Pintz; Danny Livshits; David Esiobu; Dhruv Choudhary; Dhruv Mahajan; Diego Garcia-Olano; Diego Perino; Dieuwke Hupkes; Egor Lakomkin; Ehab Albadawy; Elina Lobanova; Emily Dinan; Eric Michael Smith; Filip Radenovic; Frank Zhang; Gabriel Synnaeve; Gabrielle Lee; Georgia Lewis Anderson; Graeme Nail; Grégoire Mialon; Guan Pang; Guillem Cucurell; Hailey Nguyen; Hannah Korevaar; Hu Xu; Hugo Touvron; Iliyan Zarov; Arrieta Imanol; Isabel M Ibarra; Ishan Kloumann; Ivan Misra; Jade Evtimov; Jaewon Copet; Jan Lee; Jana Geffert; Jason Vranes; Jay Park; Jeet Mahadeokar; Jelmer Shah; Jennifer Van Der Linde; Jenny Billock; Jenya Hong; Jeremy Lee; Jianfeng Fu; Jianyu Chi; Jiawen Huang; Jie Liu; Jiecao Wang; Joanna Yu; Joe Bitton; Jongsoo Spisak; Joseph Park; Joshua Rocca; Joshua Johnstun; Junteng Saxe; Kalyan Jia; Kartikeya Vasuden Alwala; Kate Upasani; Ke Plawiak; Kenneth Li; Kevin Heafield;  Stone (2024). The llama 3 herd of models. 
[b25] Fahim Faisal; Yinkai Wang; Antonios Anastasopoulos (2022). Dataset geography: Mapping language data to language users. Association for Computational Linguistics
[b26] Leo Gao; Stella Biderman; Sid Black; Laurence Golding; Travis Hoppe; Charles Foster; Jason Phang; Horace He; Anish Thite; Noa Nabeshima; Shawn Presser; Connor Leahy (2021). The pile: An 800gb dataset of diverse text for language modeling. 
[b27] M Michael; Ryan Grynbaum;  Mac (2023-12). The times sues openai and microsoft over a.i. use of copyrighted work. 
[b28] Suchin Gururangan; Ana Marasovic; Swabha Swayamdipta; Kyle Lo; Iz Beltagy; Doug Downey; Noah A Smith (2020). Don't stop pretraining: Adapt language models to domains and tasks. Association for Computational Linguistics
[b29] Jordan Hoffmann; Sebastian Borgeaud; Arthur Mensch; Elena Buchatskaya; Trevor Cai; Eliza Rutherford; Diego De Las; Lisa Anne Casas; Johannes Hendricks; Aidan Welbl; Tom Clark; Eric Hennigan; Katie Noland; George Millican; Bogdan Van Den Driessche; Aurelia Damoc; Simon Guy; Karen Osindero; Erich Simonyan; Jack W Elsen; Oriol Rae; Laurent Vinyals;  Sifre (2022). Training compute-optimal large language models. 
[b30] Kaiyu Huang; Fengran Mo; Hongliang Li; You Li; Yuanchi Zhang; Weijian Yi; Yulong Mao; Jinchen Liu; Yuzhuang Xu; Jinan Xu (2024). A survey on large language models with multilingualism: Recent advances and new frontiers. 
[b31] Hakan Inan; Khashayar Khosravi; Richard Socher (2017). Tying word vectors and word classifiers: A loss framework for language modeling. 
[b32] James Kirkpatrick; Razvan Pascanu; Neil Rabinowitz; Joel Veness; Guillaume Desjardins; Andrei A Rusu; Kieran Milan; John Quan; Tiago Ramalho; Agnieszka Grabska-Barwinska (2017). Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences
[b33] Taku Kudo; John Richardson (2018). Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. Association for Computational Linguistics
[b34] Guokun Lai; Qizhe Xie; Hanxiao Liu; Yiming Yang; Eduard H Hovy (2017). RACE: large-scale reading comprehension dataset from examinations. Association for Computational Linguistics
[b35] Dac Viet; Nghia Trung Lai; Amir Ngo; Ben Pouran; Hieu Veyseh; Franck Man; Trung Dernoncourt; Thien Huu Bui;  Nguyen (2023). Chatgpt beyond english: Towards a comprehensive evaluation of large language models in multilingual learning. 
[b36] Mike Lewis; Yinhan Liu; Naman Goyal; Marjan Ghazvininejad; Abdelrahman Mohamed; Omer Levy; Veselin Stoyanov; Luke Zettlemoyer (2020). BART: denoising sequence-to-sequence pretraining for natural language generation, translation, and comprehension. Association for Computational Linguistics
[b37] Shen Li; Yanli Zhao; Rohan Varma; Omkar Salpekar; Pieter Noordhuis; Teng Li; Adam Paszke; Jeff Smith; Brian Vaughan; Pritam Damania; Soumith Chintala (2020-08). Pytorch distributed: Experiences on accelerating data parallel training. Proc. VLDB Endow
[b38] Zihao Li; Yucheng Shi; Zirui Liu; Fan Yang; Ninghao Liu; Mengnan Du (2024). Quantifying multilingual performance of large language models across languages. 
[b39] Tao Lin; Sebastian U Stich; Kshitij Kumar; Martin Patel;  Jaggi (2020). Don't use large mini-batches, use local SGD. 
[b40] Lin Long; Rui Wang; Ruixuan Xiao; Junbo Zhao; Xiao Ding; Gang Chen; Haobo Wang (2024). On llms-driven synthetic data generation, curation, and evaluation: A survey. 
[b41] Ilya Loshchilov; Frank Hutter (2019). Decoupled weight decay regularization. 
[b42] Alexandre Magueresse; Vincent Carles; Evan Heetderks (2020). Low-resource languages: A review of past work and future challenges. 
[b43] Sam Mccandlish; Jared Kaplan; Dario Amodei; Openai Dota; Team  (2018). An empirical model of large-batch training. 
[b44] Sean Mcleish; Arpit Bansal; Alex Stein; Neel Jain; John Kirchenbauer; Brian R Bartoldson; Bhavya Kailkhura; Abhinav Bhatele; Jonas Geiping; Avi Schwarzschild; Tom Goldstein (2024). Transformers can do arithmetic with the right embeddings. 
[b45] Brendan Mcmahan; Eider Moore; Daniel Ramage; Seth Hampson; Blaise Aguera Y Arcas (2017). Communication-efficient learning of deep networks from decentralized data. PMLR
[b46] Brendan Mcmahan; Eider Moore; Daniel Ramage; Seth Hampson; Blaise Aguera Y Arcas (2017). Communication-efficient learning of deep networks from decentralized data. PMLR
[b47] Meta Ai (2024). Introducing meta llama 3: The most capable openly available llm to date. Meta AI
[b48] Luca Moschella; Valentino Maiorca; Marco Fumero; Antonio Norelli; Francesco Locatello; Emanuele Rodolà (2023). Relative representations enable zero-shot latent space communication. 
[b49] Igor Vladislav Mosin; Borislav Samenko; Alexey Kozlovskii; Ivan P Tikhonov;  Yamshchikov (2023). Fine-tuning transformers: Vocabulary transfer. Artificial Intelligence
[b50] Alex Nichol; Joshua Achiam; John Schulman (2018). On first-order meta-learning algorithms. 
[b51] Jose Javier; Gonzalez Ortiz; Jonathan Frankle; Mike Rabbat; Ari S Morcos; Nicolas Ballas (2021). Trade-offs of local SGD at scale: An empirical study. 
[b52] Sahil Patel; Stephanie Palazzolo (2024-01). OpenAI offers publishers as little as $1 million a year -the information. 
[b53] Telmo Pires; Eva Schlinger; Dan Garrette (2019). How multilingual is multilingual bert? In ACL. Association for Computational Linguistics
[b54] Ofir Press; Noah Smith; Mike Lewis (2022). Train short, test long: Attention with linear biases enables input length extrapolation. 
[b55] Alec Radford; Karthik Narasimhan; Tim Salimans; Ilya Sutskever (2018). Improving language understanding by generative pre-training. 
[b56] Alec Radford; Jeff Wu; Rewon Child; David Luan; Dario Amodei; Ilya Sutskever (2019). Language models are unsupervised multitask learners. 
[b57] Colin Raffel; Noam Shazeer; Adam Roberts; Katherine Lee; Sharan Narang; Michael Matena; Yanqi Zhou; Wei Li; Peter J Liu (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res
[b58] Samyam Rajbhandari; Jeff Rasley; Olatunji Ruwase; Yuxiong He (2020). Zero: memory optimizations toward training trillion parameter models. IEEE/ACM
[b59] Pieter Franc ¸ois Remy; Hayastan Delobelle; Alfiya Avetisyan; Miryam Khabibullina; Thomas De Lhoneux;  Demeester (2024). Trans-tokenization and cross-lingual vocabulary transfers: Language adaptation of LLMs for low-resource NLP. 
[b60] Phillip Rust; Jonas Pfeiffer; Ivan Vulic; Sebastian Ruder; Iryna Gurevych (2021). How good is your tokenizer? on the monolingual performance of multilingual language models. Association for Computational Linguistics
[b61] Lorenzo Sani; Alex Iacob; Zeyu Cao; Bill Marino; Yan Gao; Tomas Paulik; Wanru Zhao; William F Shen; Preslav Aleksandrov; Xinchi Qiu; Nicholas D Lane (2024). The future of large language model pre-training is federated. 
[b62] Le Teven; Angela Scao; Christopher Fan; Ellie Akiki; Suzana Pavlick; Daniel Ilic; Roman Hesslow; Alexandra Castagné;  Sasha Luccioni; Matthias Franc ¸ois Yvon; Jonathan Gallé; Alexander M Tow; Stella Rush; Albert Biderman; Pawan Webson; Thomas Sasanka Ammanamanchi; Benoît Wang; Niklas Sagot; Albert Muennighoff; Olatunji Villanova Del Moral; Rachel Ruwase; Stas Bawden; Angelina Bekman; Iz Mcmillan-Major; Huu Beltagy; Lucile Nguyen; Samson Saulnier; Pedro Ortiz Tan; Victor Suarez; Hugo Sanh; Yacine Laurenc ¸on; Julien Jernite; Margaret Launay; Colin Mitchell; Aaron Raffel; Adi Gokaslan; Aitor Simhi; Alham Soroa; Amit Fikri Aji; Anna Alfassy; Ariel Kreisberg Rogers; Canwen Nitzav; Chenghao Xu; Chris Mou; Christopher Emezue; Colin Klamm;  Leong; David Daniel Van Strien;  Ifeoluwa Adelani (2022). BLOOM: A 176bparameter open-access multilingual language model. 
[b63] Richard Socher; Alex Perelygin; Jean Wu; Jason Chuang; Christopher D Manning; Andrew Y Ng; Christopher Potts (2013). Recursive deep models for semantic compositionality over a sentiment treebank. ACL
[b64] Sebastian U Stich (2019). Local SGD converges fast and communicates little. 
[b65] Chaofan Tao; Qian Liu; Longxu Dou; Niklas Muennighoff; Zhongwei Wan; Ping Luo; Min Lin; Ngai Wong (2024). Scaling laws with vocabulary: Larger models deserve larger vocabularies. 
[b66] Asahi Ushio; Yi Zhou; José Camacho-Collados (2023). Efficient multilingual language model compression through vocabulary trimming. Association for Computational Linguistics
[b67] Ahmet Üstün; Viraat Aryabumi; Xin Zheng; Wei-Yin Yong;  Ko; D' Daniel; Gbemileke Souza; Neel Onilude; Shivalika Bhandari; Hui-Lee Singh; Amr Ooi; Freddie Kayid; Phil Vargus; Shayne Blunsom; Niklas Longpre; Marzieh Muennighoff; Julia Fadaee; Sara Kreutzer;  Hooker (2024). Aya model: An instruction finetuned open-access multilingual language model. Association for Computational Linguistics
[b68] Pablo Villalobos; Jaime Sevilla; Lennart Heim; Tamay Besiroglu; Marius Hobbhahn; Anson Ho (2022). Will we run out of data? an analysis of the limits of scaling datasets in machine learning. 
[b69] Guan Wang; Sijie Cheng; Xianyuan Zhan; Xiangang Li; Sen Song; Yang Liu (2023). Openchat: Advancing open-source language models with mixed-quality data. 
[b70] Zirui Wang; Zachary C Lipton; Yulia Tsvetkov (2020). On negative interference in multilingual models: Findings and A meta-learning treatment. Association for Computational Linguistics
[b71] Adina Williams; Nikita Nangia; Samuel R Bowman (2018). A broad-coverage challenge corpus for sentence understanding through inference. Association for Computational Linguistics
[b72] Herbert Woisetschläger; Alexander Erben; Bill Marino; Shiqiang Wang; Nicholas D Lane; Ruben Mayer; Hans-Arno Jacobsen (2024). Federated learning priorities under the european union artificial intelligence act. 
[b73] Mitchell Wortsman; Peter J Liu; Lechao Xiao; Katie E Everett; Alexander A Alemi; Ben Adlam; John D Co-Reyes; Izzeddin Gur; Abhishek Kumar; Roman Novak; Jeffrey Pennington; Jascha Sohl-Dickstein; Kelvin Xu; Jaehoon Lee; Justin Gilmer; Simon Kornblith (2024). Small-scale proxies for large-scale transformer training instabilities. 
[b74] Yangyifan Xu; Jinliang Lu; Jiajun Zhang (2024). Bridging the gap between different vocabularies for LLM ensemble. Association for Computational Linguistics
[b75] Linting Xue; Noah Constant; Adam Roberts; Mihir Kale; Rami Al-Rfou; Aditya Siddhant; Aditya Barua; Colin Raffel (2021). mt5: A massively multilingual pre-trained text-to-text transformer. sociation for Computational Linguistics
[b76] Hao Yu; Rong Jin; Sen Yang (2019). On the linear speedup analysis of communication efficient momentum SGD for distributed non-convex optimization. PMLR
[b77] Shaolei Zhang; Qingkai Fang; Zhuocheng Zhang; Zhengrui Ma; Yan Zhou; Langlin Huang; Mengyu Bu; Shangtong Gui; Yunji Chen; Xilin Chen (2023). Bayling: Bridging cross-lingual alignment and instruction following through interactive translation for large language models. 
[b78] Susan Zhang; Stephen Roller; Naman Goyal; Mikel Artetxe; Moya Chen; Shuohui Chen; Christopher Dewan; Mona T Diab; Xian Li; Xi Victoria Lin; Todor Mihaylov; Myle Ott; Sam Shleifer; Kurt Shuster; Daniel Simig; Punit Singh Koura; Anjali Sridhar; Tianlu Wang; Luke Zettlemoyer (2022). OPT: open pre-trained transformer language models. 
[b79] Wanru Zhao; Yihong Chen; Royson Lee; Xinchi Qiu; Yan Gao; Hongxiang Fan; Nicholas Donald; Lane  (2024). Breaking physical and linguistic borders: Multilingual federated prompt tuning for low-resource languages. 
[b80] Yanli Zhao; Andrew Gu; Rohan Varma; Liang Luo; Chien-Chin Huang; Min Xu; Less Wright; Hamid Shojanazeri; Myle Ott; Sam Shleifer; Alban Desmaison; Can Balioglu; Pritam Damania; Bernard Nguyen; Geeta Chauhan; Yuchen Hao; Ajit Mathews; Shen Li (2023). Pytorch FSDP: experiences on scaling fully sharded data parallel. Proc. VLDB Endow

Figures:
Figure fig_0: 1
Type: figure
Caption: Figure 1 :1Figure 1: Pipeline for DEPT variants: TRIM (top-right), GLOB (bottom-left), SPEC (bottom-right), with the STANDARD approach (top-left). The numbered pipeline steps proceed as follows: (1) text corpora are processed into a vocabulary and tokenizer (global for STANDARD, GLOB, and TRIM; global or personalized for SPEC); (2) corpora are tokenized into a pre-tokenized dataset; (3) WORKERS train the model on their pre-tokenized data; (4) partial training results are collected; (5) results are aggregated; (6) the new model is sent to WORKERS. Steps 3-6 repeat to convergence.
Data: 

Figure fig_1: 2
Type: figure
Caption: Figure 2 :2Figure 2: Activations and model norms of STANDARD (STD) training versus DEPT (avg ± min/max) for a 350M model trained with identical local hyperparameters-prior to adjusting STD (τ = 0) and STD (τ = 1) (uniform and proportional sampling) to a lower learning rate. The OuterOpt of DEPT introduces regularization effects due to noise-injection(Lin et al., 2020), meta-learning(Nichol et al., 2018) characteristics, which constrain these sources(Zhang et al., 2022) of model divergence.
Data: 

Figure fig_2: 
Type: figure
Caption:  
Data: 

Figure fig_3: 3
Type: figure
Caption: Figure 3 :3Figure3: Adaptation curves starting from a randomly initialized matrix. DEPT variants are always stable in their convergence, reaching the lowest perplexity for the full dataset and the out-ofdistribution language (HI). It is also always the fastest to adapt, full results available in Figure5
Data: 

Figure fig_4: 4
Type: figure
Caption: Figure 4 :4Figure 4: Convergence plot of our 1.3 billion model trained in a vocabulary agnostic federated fashion.For the initial rounds, we sample 4 data sources out of 8; after seeing most of the clients, we reduce the number to 2. We make sure only to introduce EN later into the experiment.
Data: 

Figure fig_5: 
Type: figure
Caption:  
Data: 

Figure fig_6: 
Type: figure
Caption: DE, 
Data: 

Figure fig_7: 5
Type: figure
Caption: Figure 5 :5Figure5: Adaptation curves starting from a randomly initialized matrix. DEPT is always stable in its convergence, reaching the lowest perplexity for the pre-training distribution (MC4-FULL), for the lowest-resource languages in the distribution (SW), and for the two out-of-distribution languages (HI, DE). It is also always the fastest to adapt.
Data: 

Figure fig_9: 6
Type: figure
Caption: Figure 6 :6Figure 6: Perplexity (a) and activations (b) curves for DEPT versus uniform sampling on the IID C4 dataset. DEPT models, outside temporary spikes caused by OuterOpt, perform similarly to standard pre-training regarding training perplexity. However, as seen from the activations, it still provides greater training stability with the potential of extending pre-training.
Data: 

Figure tab_1: 
Type: table
Caption: I[thej-th token in V corresponds to thei-th local token inV k ] selects tokens from ϕ. After InnerOPT we create φk ∈ R |V|×d model , using zero-padding for tokens in V \ V k , and use
Data: 

Figure tab_2: 1
Type: table
Caption: Memory and communication costs of DEPT, where: M is the number of model parame-
Data: MethodMemory CostPer-step Comms Cost Vocab AgnosticSTD

Figure tab_4: 2
Type: table
Caption: Practical memory and communication costs for DEPT, where the total number of steps is N = N local T with T the total number of iterations, and V k as the average vocabulary size across data sources. Standard pre-training requires a full in-memory embedding matrix for the global vocabulary while synchronizing gradients every step rather than every N local steps. All DEPT variants yield communication savings, with GLOB as the baseline. TRIM provides additional savings proportional to the gap between global and local vocabulary sizes, while SPEC further reduces costs by never communicating embeddings. For the full comparison, see Table9.
Data: Type#BlocksMethodNlocal T|Vk| ± σ|Vk| × dmodelMk (↓)Per-step Comms Cost (↓)Multilingual12STD5 × 10 3 1250 112192M278M (1×)278M (1×)Multilingual12GLOB50010250 112192M278M (1×)0.56M (0.002×)Multilingual12TRIM50010 216 135 ± 27 160166M252M (0.92×)0.5M (0.002×)Multilingual12SPEC50010 216 135 ± 27 160166M252M (0.92×)0.17M (0.0006×)Multilingual12SPEC-OPT5001050 257 ± 038.6M125M (0.45×)0.17M (0.0006×)Multilingual (1B)24STD7 × 10 3 1250 112512.2M1.71B (1×)1.71B (1×)Multilingual (1B)24SPEC-OPT5001450 257 ± 0102.9M1.3B (0.76×)2.4M (0.001×)

Figure tab_5: 4
Type: table
Caption: 
Data: NameDMENEPFLGHCCPASEPPWKAXUBPCNHGUHNUI-OODAVG(UNIGRAM-CE)(6.9)(7.9)(10)(7.8)(7.9)(7.9)(8.2)(7.7)(9.1)(8.2)(7.7)(7.8)(8)(8.1)(7.7)(7.7)(10)(8.1)STD (τ = 0)5.544.8 93.5 30.98.179.6 46.6 23.4 126.6 58.2 14.3 34.1 22.3 58.9 76.3 65.2163.656STD (τ = 1)530.6 49.5 20.6656.2 30.9 16.881.239.11123.7 16.1 39.3 54.6 46.99936.9ACT---GLOB4.825.7 38.2 17.35.447.7 25.7 14.7 68.3 32.7 9.9201432.2 46.5 39.894.831.6TRIM4.827.3 39.5 18.55.651.2 27.8 15.471.835.1 10.3 21.7 14.8 35.1 49.1 42.295.733.3SPEC4.826.7 36.8 18.25.550.1 27.1 15.169.134.2 10.1 21.1 14.5 34.3 48.5 41.797.632.7SPEC-OPT4.725.93517.55.448.3 26.1 14.7 66.6 32.89.920.4 14.1 32.9 47.3 40.588.631.2Min Imp (%)3.7 10.6 20.2 10.1 7.48.9 10.3 8.411.5 10.378.68.2 10.6 9.9101.49.7Max Imp (%) 4.2 15.7 29.3 16.31115.1 16.9 12.9 17.9 16.5 10.6 15.7 13.31814.7 15.210.515.3In-DistributionOut-of-DistributionNameZHURMSITSRLAENSWAvg (In-D)ELHIDEAvg (OOD)(UNIGRAM-CE)(9.8)(10.5)(9.2)(7.7)(10.5)(9)(7.5)(10)(9.3)(14.4)(13.9)(9.7)(12.6)STD (τ = 0)154.838.296.8 83.873.363112.7 62.885.75660.84600.3 1339.23866.8STD (τ = 0.3) 129.534.58875.465.256.3 103.7 56.876.24219.239961076.33097.1STD (τ = 1)84.626.864.8 55.147.141.177.642.454.93340.32514.7672.52175.8ACT96.128.871.3 60.452.344.985.646.360.72450.22412.5715.91859.5GLOB67.722.4 53.74638.6 33.9 65.4 35.245.42308.31676.5559.51514.7TRIM67.722.855.2 47.539.735.167.236.346.42547.71911567.41675.4SPEC69.52355.4 47.840.334.768.136.346.92232.1 1578.8 544.71451.9Min Imp (%)17.81414.5 13.4 14.6 14.6 12.2 14.314.4-420.815.610.8Max Imp (%)2016.4 17.1 16.6 18.1 17.4 15.7 16.917.38.934.61920.8

Figure tab_7: 6
Type: table
Caption: Validation perplexity (↓) for 12-block models trained on MC4 using continued pre-training with uniform sampling from pre-trained embeddings. DEPT achieves a 6.4% improvement in average perplexity for in-distribution data but slightly underperforms for OOD data, winning 50% = 4 8 of in-distribution and 33% = 1 3 of OOD comparisons. In Table4, DEPT wins the remaining cases due to a better transformer body.
Data: 3 -6.4-13.1-6

Figure tab_8: 7
Type: table
Caption: 
Data: Random InitNameRACE (ACC) MNLI (ACC) STSB (PC) SST2 (ACC)STD (τ = 0)0.500.600.660.79STD (τ = 1)0.460.680.730.81ACT0.450.660.730.80GLOB0.510.720.780.83TRIM0.530.710.780.83SPEC0.520.710.790.81SPEC-OPT0.510.690.770.85Min Imp (%)2.9%4.6%5.9%-0.7%Max Imp (%)5.8%6.1%7.5%4.1%

Figure tab_10: 9
Type: table
Caption: Practical memory and communication costs for DEPT, where the total number of steps is N = N local T with T the total number of iterations, and V k as the average vocabulary size across data sources. Standard pre-training requires a full in-memory embedding matrix for the global vocabulary while synchronizing gradients every step rather than every N local steps. All DEPT variants yield communication savings, with GLOB as the baseline. TRIM provides additional savings proportional to the gap between global and local vocabulary sizes, while SPEC further reduces costs with or without optimized vocabularies by never communicating the token or positional matrices.
Data: Type#BlocksMethodNlocalT|Vk| ± σ|Vk| × dmodelMk (↓)Per-step Comms Cost (↓)Multilingual12STD5 × 10 31250 112192M278M (1×)278M (1×)Multilingual12GLOB50010250 112192M278M (1×)0.56M (0.002×)Multilingual12TRIM50010 216 135 ± 27 160166M252M (0.92×)0.5M (0.002×)Multilingual12SPEC50010 216 135 ± 27 160166M252M (0.92×)0.17M (0.0006×)Multilingual12SPEC-OPT5001050 257 ± 038.6M125M (0.45×)0.17M (0.0006×)Multilingual-B24STD7 × 10 31250 112512.2M1.71B (1×)1.71B (1×)Multilingual-B24SPEC-OPT5001450 257 ± 0102.9M1.3B (0.76×)2.4M (0.001×)Multi-domain12STD5 × 10 3150 25738.6M125M (1×)125M (1×)Multi-domain12GLOB5001050 25738.6M125M (1×)0.25M (0.002×)Multi-domain12TRIM5001045 554 ± 946235M121M (0.97×)0.24M (0.002×)Multi-domain12SPEC5001045 554 ± 946235M121M (0.97×)0.17M (0.001×)Multi-domain24STD13.5 × 10 3 150 25751.4M350M (1×)350M (1×)Multi-domain24GLOB5002750 25751.4M350M (1×)0.7M (0.002×)Multi-domain24TRIM5002745 554 ± 946246.6M345.2M (0.97×)0.69M (0.002×)Multi-domain24SPEC5002745 554 ± 946246.6M345.2M (0.97×)0.6M (0.002×)

Figure tab_11: 10
Type: table
Caption: Validation perplexity (↓) for our 24-block models trained on The Pile when using continued pre-training with uniform sampling starting from randomly-initialized embeddings. DEPT provides a better transformer body for all datasets, outperforming baselines by 17.5% on average.
Data: NameDMENEPFLGHCCPAENPPWKAXUBPCNHGUHNUI-OODAVG(UNIGRAM-CE)(6.9)(7.9)(10)(7.8)(7.9)(7.9)(8.2)(7.7)(9.1)(8.2)(7.7)(7.8)(8)(8.1)(7.7)(7.7)(10)(8.1)STD (τ = 0)4.717.5 19.7 22.3764.2 30.7 18.3 48.5 41.61323.91932.1 50.54272.931.1STD (τ = 1)5.124.1 27.1 31.49.286.6 43.4 24.6 64.4 58.4 16.5 32.7 25.1 44.7 66.7 55.814144.5ACT------------------GLOB4.5 14.2 16.3186.1 53.6 24.9 15.5 40.23411.2 19.71626.2 41.9 34.658.825.6TRIM4.514.8 16.7 19.16.456.6 26.4 16.3 42.13611.72116.9 27.7 43.7 36.266.127.2SPEC4.514.5 16.2 18.86.255.5 25.81641.1 35.1 11.5 20.5 16.5 27.2 43.1 35.763.526.6SPEC-OPT4.615.2 16.9 19.46.457.1 26.5 16.4 42.5 35.9 11.92116.5 27.8 44.9 37.160.427.1Min Imp (%)2.5 13.5 14.3 12.9 8.4 11.2 13.9 10.4 12.5 13.7 8.2 12.3 11.5 13.41111.89.212.5Max Imp (%) 4.8 19.11819.31316.51915.3 17.1 18.3 13.5 17.61618.6 17.1 17.619.417.5

Figure tab_13: 12
Type: table
Caption: Validation perplexity (↓) for our 12-block models trained on The Pile when performing continued pre-training starting from a randomly-initialized embedding matrix. DEPT can train a superior transformer body, outperforming all baselines across all subsets by up to 28%.
Data: NameNHGHPAUBFLEEEPWKCCSEPCPPDMAXGUHNUI-OODAVG(UNIGRAM-CE)(8.1)(7.9)(8.2)(7.8)(7.8)(7.9)(10)(8.2)(7.9)(7.7)(8)(9.1)(6.9)(7.7)(7.7)(7.7)(10)(8.1)STD (τ = 0)63.012.161.944.1 45.2 36.347.981.8 115.9 33.2 32.491.85.821.691.275.0198.162.2STD (τ = 1)58.611.457.241.3 42.2 33.643.175.2 108.2 31.0 30.385.45.720.485.670.9168.057.0ACT126.720124.8 79.9 82.1 66.3 124.2 147.7 191.6 55.4 61.2 180.17.433.8 150.8 123.9377.8114.9GLOB44.59.343.032.0 32.1 25.931.458.583.923.8 23.164.65.116.266.454.7114.642.9TRIM43.38.841.831.2 30.7 24.629.456.282.3 23.4 22.6 62.75.116.0 64.1 53.299.040.8SPEC42.18.740.6 30.3 29.8 23.8 28.0 54.887.024.9 23.867.55.216.869.157.1124.243.2Min Imp (%)24192523242327222020212181819202624Max Imp (%)282429272929352724252527112225254128

Figure tab_15: 15
Type: table
Caption: Validation perplexity (↓) for 12-block models trained on MC4 after continued pretraining with unfiorm sampling from randomly-initialized embeddings, compared to models which had been pre-trained on a single data source for the same total number of tokens as DEPT has seen from their distributions. Baselines whose pre-training dataset matches the evaluation dataset are highlighted in olive.
Data: NameDMEEEPFLGHCCPASEPPWKAXUBPCNHGUHNUI-OODAVG(UNIGRAM-CE)(6.9)(7.9)(10)(7.8)(7.9)(7.9)(8.2)(7.7)(9.1)(8.2)(7.7)(7.8)(8)(8.1)(7.7)(7.7)(10)(8.1)CC4.829.344.4205.954.43016.477.337.710.823.2 15.8385244.7109.236.1PC4.828.140.319.15.752.127.9 15.772.835.810.421.9 14.8 35.550.3 43.4110.634.7AX4.928.941.719.85.753.529.1 15.974.436.810.522.5 15.3 36.852.1 44.897.434.7GH4.930.143.620.85.855.930.7 16.578.238.510.923.81638.954.5 46.7120.737.4FL4.931.849.821.56.358.532.4 17.484.240.911.424.8 16.84155.8 47.9122.139.3SE4.828.242.219.45.652.92915.57536.610.522.4 15.3 36.850.8 43.410034.6WK4.828.142.118.95.751.628.4 15.87434.810.521.9 15.1 35.749.8 43.295.433.9DM7.3140.4 559.4 100.628239.5 18471543.9 213.9 34.8 121.7 69.3 213.9 193170966.8226.9GLOB4.825.738.217.35.447.7 25.7 14.7 68.332.79.9201432.2 46.5 39.894.831.6TRIM4.827.339.518.55.651.227.8 15.471.835.110.321.7 14.8 35.149.1 42.295.733.3SPEC4.826.736.818.25.550.127.1 15.169.134.210.121.1 14.5 34.348.5 41.797.632.7SPEC-OPT4.725.93517.55.448.326.1 14.7 66.632.89.920.4 14.1 32.947.3 40.588.631.2Min Imp (%)0.72.721.60.50.80.50.31.4-0.91.21011.32.1-2.31.7Max Imp (%) 1.28.313.28.44.37.67.95.38.5658.85.69.26.57.87.17.8In-DistributionOut-of-DistributionIn-DistributionOut-of-DistributionNameZHURMSITSRLAENSWAvg (In-D)ELHIDEAvg (OOD)(UNIGRAM-CE)(9.8)(10.5)(9.2)(7.7)(10.5)(9)(7.5)(10)(9.3)(14.4)(13.9)(9.7)(12.6)ZH187.844.6113.9 98.689.173.9 128.876101.65744.86476.514484556.4UR94.827.56656.948.742.579.344.157.52596.52371.1690.51886MS78.824.8585042.737.37038.950.12673.32329.2599.41867.3IT81.325.159.75143.838.271.839.851.32617.22256.9615.31829.8SR85.425.76152.444.939.473.740.952.92992.42648.3657.22099.3LA104.829.671.760.653.54685.247.662.42838.72824.8746.62136.7EN104.430.171.961.254.346.28547.862.63344.43360.6834.82513.3SW79.324.758.149.94337.369.93950.12552.32067.5608.31742.7GLOB67.722.453.74638.6 33.9 65.4 35.245.42308.31676.5559.51514.7TRIM67.722.855.247.539.735.167.236.346.42547.71911567.41675.4SPEC69.52355.447.840.334.768.136.346.92232.1 1578.8 544.71451.9Min Imp (%)11.86.64.44.25.762.76.86.40.27.65.33.9Max Imp (%) 14.19.37.47.79.69.16.59.69.412.523.69.116.7

Figure tab_16: 17
Type: table
Caption: Validation perplexity (↓) for 12-block models trained on MC4 after continued pretraining with uniform sampling from pre-trained embeddings, compared to models which had been pre-trained on a single data source for the same total number of tokens as DEPT has seen from their distributions. DEPT significantly outperforms in terms of average perplexity but gets beaten by specialized models on their respective data source. Baselines whose pre-training dataset matches the evaluation dataset are highlighted in olive.
Data: NameDMEEEPFLGHCCPASEPPWKAXUBPCNHGUHNUI-OODAVG(UNIGRAM-CE)(6.9)(7.9)(10)(7.8)(7.9)(7.9)(8.2)(7.7)(9.1)(8.2)(7.7)(7.8)(8)(8.1)(7.7)(7.7)(10)(8.1)CC19.6 57.6 294.435.132.230.841.147.9133.135.844.2264743.752.838.279.262.3PC4.825.537.517.35.545.517.714.86330.79.417.610.123.246.439100.829.9AX4.82738.518.75.549.725.914.764.833.87.419.513.832.949.541.4110.332.8GH531.147.324.43.962.637.113.380.644.511.626.417.947.660.847.170.137.1FL4.923.849.610.15.94527.315.873.431.610.620.414.733.642.538.499.532.2SE4.724.338.317.94.445.727.19.362.1339.519.814.334.346.134.770.229.2WK4.923.833.816.45.74025.115.257.418.610.219.413.931.339.537.498.728.9DM4.481.6 210.959.314.4143.899.741.2248.4 116.522.667.940.8124.1 126.1 108.1424.8113.8GLOB4.51716.113.24.534.517.911.237.822.48.414.41120.635.5 28.361.221.1TRIM4.620.52313.94.63820.21249.925.18.716.611.825.73832.956.823.7Min Imp (%) -2.9 13.932-37.4 -19.8 -23.5 -14.1 -28.2 13.1 -34.8 -18.7 5.5 -16.9 -10.93.85.312.718.1Max Imp (%)-128.5 52.3 -31.1 -16.6-12-1.1 -19.8 34.2 -20.4 -14.3 18.1 -9.311.210.1 18.51927In-DistributionOut-of-DistributionNameZHURMSITSRLAENSWAvg (In-D)ELHIDEAvg (OOD)(UNIGRAM-CE)(9.8)(10.5)(9.2)(7.7)(10.5)(9)(7.5)(10)(9.3)(14.4)(13.9)(9.7)(12.6)ZH33.336.687.171.368.755.990.258.162.65351.5127 3197.97314 936.293457 3161.92643UR124.712.770.862.957.449.576.849.863.13189.1106 774.645325 802.290588 1588.68217MS89.52627.848.147.138.258.938.346.72983.83643 2491.64209 620.154907 2031.87781IT91.427.459.624.446.935.859.640.748.22353.07642 3057.45215 344.762726 1918.43043SR99.328.165.152.520.241.770.143.652.62174.83203 3398.59985 643.610352 2072.34741LA100.230.266.948.15120.168.145.153.7716.628967 2479.87817 240.702454 1145.73653EN1142.1 150.7 276.5 211.1408.8147.1 86.2 169.1323.91315877.75 679658.688 11445.9727 668994.137SW91.626.255.948.34738.158.317.847.92782.6792 2673.07813 557.471863 2004.40973GLOB40.115.530.139.63929.740.524.632.41737.3823.4335.1965.3TRIM41.916.231.341.340.830.84225.633.71725855.2345.6975.3Min Imp (%) -26.1 -28.1 -12.8 -68.9 -101.9-5328-43.927.8-142.4-10.4-43.614.9Max Imp (%) -20.6 -22.8 -8.5 -62.1 -92.7 -47.4 30.6 -38.630.7-142.4-10.4-43.614.9B.9 COMPARISON AGAINST PYTHIA

Figure tab_17: 18
Type: table
Caption: Validation perplexity (↓) for our 12-block models trained on The Pile when performing continued pre-training using uniform sampling starting from a pre-trained embedding matrix. DEPT slightly outperforms Pythia-160M at this small scale as its 30× greater number of tokens is not beneficial with insufficient model capacity. Pythia-160M was trained on Ubuntu IRC (UI), thus its outperformance is expected as it is not an OOD dataset for this model.
Data: NameNHGHPAUBFLEEEPWKCCSEPCPPDMAXGUHNUI-OODAVG(UNIGRAM-CE)(8.1)(7.9)(8.2)(7.8)(7.8)(7.9)(10)(8.2)(7.9)(7.7)(8)(9.1)(6.9)(7.7)(7.7)(7.7)(10)(8.1)PYTHIA-160M 47.38.236.8 31.4 24.13432.8 40.2 64.1 22.4 21.35 74.56.816.35554.924.3133.1GLOB30.27.129.8 22.9 23.7 20.0 21.9 41.6 61.8 17.919.648.05.213.7 54.1 42.290.732.4TRIM29.5 6.9 29.2 22.4 23.0 19.4 21.1 40.8 60.6 17.4 19.2 46.7 5.1 13.4 52.4 41.081.131.1

Figure tab_18: 19
Type: table
Caption: Validation perplexity (↓) for 24-block models trained on The Pile after continued pre-training with proportional sampling from randomly-initialized embeddings, compared to Pythia-410M. DEPT models come close to Pythia-410M despite the latter being trained on 10× more tokens, indicating a comparable if slightly worse transformer body. Pythia-410M was trained on Ubuntu IRC (UI), thus its outperformance is expected as it is not an OOD dataset for this model.
Data: NameDMEEEPFLGHCCPASEPPWKAXUBPCNHGUHNUI-OODAVG(UNIGRAM-CE)(6.9)(7.9)(10)(7.8)(7.9)(7.9)(8.2)(7.7)(9.1)(8.2)(7.7)(7.8)(8)(8.1)(7.7)(7.7)(10)(8.1)PYTHIA-410M 4.925.9 43.3 17.4 5.145.6 24.7 13.9 65.7 31.7 9.618.8 13.5 31.3 44.5 38.3 81.230.3GLOB4.825.7 38.2 17.35.447.7 25.7 14.7 68.3 32.79.9201432.2 46.5 39.894.831.6TRIM4.827.3 39.5 18.55.651.2 27.8 15.4 71.8 35.1 10.3 21.7 14.8 35.1 49.1 42.295.733.3SPEC4.826.7 36.8 18.25.550.1 27.1 15.1 69.1 34.2 10.1 21.1 14.5 34.3 48.5 41.797.632.7SPEC-OPT4.725.93517.55.448.3 26.1 14.7 66.6 32.89.920.4 14.1 32.9 47.3 40.588.631.2

Figure tab_19: 20
Type: table
Caption: Validation perplexity (↓) for 24-block models trained on The Pile after continued pre-training with proportional sampling from randomly-initialized embeddings, compared to Pythia-410M. Pythia-410M significantly outperforms DEPT as its 30× larger number of training tokens allow it to train much better embeddings. Pythia-410M was trained on Ubuntu IRC (UI), thus its outperformance is expected as it is not an OOD dataset for this model.
Data: NameDMEEEPFLGHCCPASEPPWKAXUBPCNHGUHNUI-OODAVG(UNIGRAM-CE)(6.9)(7.9)(10)(7.8)(7.9)(7.9)(8.2)(7.7)(9.1)(8.2)(7.7)(7.8)(8)(8.1)(7.7)(7.7)(10)(8.1)PYTHIA-410M 3.89.78.97.731911.8 7.2 21.3 12.5 5.9 10.3 7.61517.6 16.97.810.9GLOB4.51716.1 13.24.534.5 17.9 11.2 37.8 22.48.414.41120.6 35.5 28.361.221.1TRIM4.620.52313.94.63820.21249.9 25.18.716.6 11.8 25.73832.956.823.7


Formulas:
Formula formula_0: θ k t , ϕt|V k , ψ k t ← InnerOPT(θt-1, ϕt-1|V k , ψt-1, D k ) ▷ TRIM 7: θ k t , ϕ k t , ψ k t ← InnerOPT(θt-1, ϕ k t-1 , ψ k t-

Formula formula_1: I ⊤ k ∈ R |V|×|V k | to project ϕ k back, φk = I ⊤ k ϕ k .

Formula formula_2: {V k } K k=1 such that V = ∪ K k=1 V k .

Formula formula_3: O(M) O(M) × GLOB O(M) O( M N local ) × TRIM O(M -(|V| -|V k |)d model ) O( M-(|V|-|V k |)d model N local ) × SPEC O(M -(|V| -|V k |)d model ) O( M-(|V|+L)d model N local ) ✓

Formula formula_4: STD(τ =0) max , η STD(τ =0.3) max , η STD(τ =1) max

Formula formula_5: (10 -1 , 2 × 10 -4 , 7 × 10 4 ) - 1 × 10 -4 1 × 10 -4 1.5 × 10 -4

Formula formula_6: η = η max -0.5k × 10 -5 , k ∈ {0, 1, 2, . . . , K},
