StructMoE: Augmenting MoEs with Hierarchically Routed Low Rank Experts

Zain Sarwar; Ashwinee Panda; Benjamin Thérien; Stephen Rawls; Sambit Sahu; Supriyo Chakraborty

StructMoE: Augmenting MoEs with Hierarchically Routed Low Rank Experts

Zain Sarwar, Ashwinee Panda, Benjamin Thérien, Stephen Rawls, Sambit Sahu, Supriyo Chakraborty

25 Sept 2024 (modified: 16 Oct 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: moe, mixture of experts, LLM, transformer

TL;DR: A method to scale MoE models using structured modules

Abstract: The traditional approach to scaling Mixture of Experts for transformer models has been to increase the total number of experts. While performance improves with more experts, the gains are diminshing whereas memory scales linearly with the number of experts. We introduce $\textit{StructMoE}$, a scaling approach for Mixture of Experts which augments experts with additional dynamic capacity using routed structured matrices which we refer to as $\textbf{L}$ow $\textbf{R}$ank $\textbf{E}$xprts ($\textbf{$\textit{LoRE}$}$). At a high-level, we introduce hierarchical MoEs where the first level of routing decides which expert each token should be routed to and the second level of routing decides which $\textit{LoRE}$ should each token be routed through. The outputs of the expert and the $\textit{LoRE}$ are then entangled together to provide the final output. This introduces more dynamism into the model which has empirically been demonstrated to improve model performance. We find this scaling approach to outperform a standard MoE baseline in terms of loss on a held out validation. Thus, we propose this to be an effective scaling technique for MoEs compared to the standard approach of adding more experts to the model.

Primary Area: foundation or frontier models, including LLMs

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 4551

Loading