TL;DR: Exploring scaling laws for upcycling dense language models to MoE, revealing key trade-offs and guidelines for efficient training.
Abstract: Pretraining large language models (LLMs) is resource-intensive, often requiring months of training time even with high-end GPU clusters.
There are two approaches of mitigating such computational demands: reusing smaller models to train larger ones (upcycling), and training computationally efficient models like mixture-of-experts (MoE).
In this paper, we study the upcycling of LLMs to MoE models, of which the scaling behavior remains underexplored.
Through extensive experiments, we identify empirical scaling laws that describe how performance depends on dataset size and model configuration.
Particularly, we show that, while scaling these factors improves performance, there is a novel interaction term between the dense and upcycled training dataset that limits the efficiency of upcycling at large computational budgets.
Based on these findings, we provide guidance to scale upcycling, and establish conditions under which upcycling outperforms from-scratch trainings within budget constraints.
Lay Summary: Training large language models takes a lot of time and computing power. We focus on how to build larger and more efficient models that only activate parts of themselves when needed (called mixture-of-experts, or MoE), by reusing smaller ones (called upcycling), to reduce training costs.
We find patterns that explain how performance changes depending on how big the dataset is and how the model is built. We also discover a new effect: when you reuse a model and give it more data, performance does not simply keep improving but is affected by how the original model was trained.
Based on our results, we offer guidelines to get the most out of upcycling, and show when it can be better than starting training from scratch, especially when working within a limited budget.
Link To Code: https://github.com/sbintuitions/sparse-upcycling-scaling-laws
Primary Area: Deep Learning->Large Language Models
Keywords: language modeling, mixture of experts, scaling law, upcycling
Submission Number: 4498
Loading