ExpertZIP: A Progressive Fusion Framework for Mixture-of-Experts Model Optimization through Huffman Tree Structures

Yi-Zeng Fang; Juinn-Dar Huang

ExpertZIP: A Progressive Fusion Framework for Mixture-of-Experts Model Optimization through Huffman Tree Structures

Yi-Zeng Fang, Juinn-Dar Huang

21 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large language models, Mixture-of-Experts, Expert Fusion

TL;DR: ExpertZIP is a Huffman tree-based expert fusion technique that optimizes Mixture-of-Experts (MoE) models, reducing both model size and inference time with minimal performance loss, ideal for resource-constrained and real-time applications.

Abstract: Mixture-of-Experts (MoE) models have gained attention as a novel approach to developing large language models (LLMs), praised for their ability to enhance performance by utilizing multiple experts. However, while increasing the number of experts in these models can yield performance gains, it also introduces significant trade-offs, such as substantial memory overhead and increased inference time, limiting their scalability and practical deployment. In this work, we conduct a thorough analysis of expert utilization and identify inefficiency: many experts are underutilized, leading to suboptimal resource allocation with limited improvement. To address this issue, we propose ExpertZIP, a progressive framework for MoE models that leverages a Huffman tree-based expert fusion technique. This progressive approach systematically merges underutilized experts step by step, ensuring their essential contributions are maintained while drastically reducing memory usage and computational demands. Our approach yields a 17.23x reduction in model size and a 4.84x improvement in inference time, with only a 1.18\% decrease in average accuracy compared to the original 64-expert Switch Transformer model. Moreover, it demonstrates a 6.47\% increase in accuracy relative to models with an equivalent number of experts. These results demonstrate that our optimized framework provides performance on par with larger models, offering an efficient solution for resource-constrained and real-time applications.

Primary Area: foundation or frontier models, including LLMs

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 2376

Loading