From Dense to Dynamic: Token-Difficulty Driven MoEfication of Pre-Trained LLMs

Kumari Nishu; Sachin Mehta; Samira Abnar; Mehrdad Farajtabar; Maxwell Horton; Mahyar Najibi; Moin Nabi; Minsik Cho; Devang Naik

From Dense to Dynamic: Token-Difficulty Driven MoEfication of Pre-Trained LLMs

Kumari Nishu, Sachin Mehta, Samira Abnar, Mehrdad Farajtabar, Maxwell Horton, Mahyar Najibi, Moin Nabi, Minsik Cho, Devang Naik

Published: 05 Mar 2025, Last Modified: 17 Apr 2025SLLMEveryoneRevisionsBibTeXCC BY 4.0

Track: long paper (up to 4 pages)

Keywords: LLM, Mixture of Experts (MoE), Adaptive Inference, Parameter Sparsity

TL;DR: A novel post-training optimization framework to transform a pre-trained dense LLM to a token-difficulty-driven Mixture-of-Experts model with minimal fine-tuning cost, enabling larger experts for complex tokens and smaller experts for simpler ones.

Abstract: Training large language models (LLMs) for different inference constraints is computationally expensive, limiting control over efficiency-accuracy trade-offs. Moreover, once trained, these models typically process tokens uniformly, regardless of their complexity, leading to static and inflexible behavior. In this paper, we introduce a post-training optimization framework, DynaMoE, that adapts a pre-trained dense LLM to a token-difficulty-driven Mixture-of-Experts model with minimal fine-tuning cost. This adaptation makes the model dynamic, with sensitivity control to customize the balance between efficiency and accuracy. DynaMoE features a token-difficulty-aware router that predicts the difficulty of tokens and directs them to the appropriate sub-networks or experts, enabling larger experts to handle more complex tokens and smaller experts to process simpler ones. Our experiments demonstrate that DynaMoE can generate a range of adaptive model variants with a single fine-tuning step, utilizing only $5B$ tokens, a minimal cost compared to the base model's training. Each variant offers distinct trade-offs between accuracy and performance.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Submission Number: 51

Loading