Diversifying the Expert Knowledge for Task-Agnostic Pruning in Sparse Mixture-of-Experts

ACL ARR 2024 December Submission608 Authors

14 Dec 2024 (modified: 05 Feb 2025)ACL ARR 2024 December SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: In this work, we address the memory overhead of deploying Mixture-of-Experts (MoE) architectures in Large Language Models (LLMs). While MoE layers improve LLM performance without increasing inference costs, the ever-growing number of experts inflates memory requirements, hindering practical deployment. Our empirical study reveals that some experts encode redundant knowledge during pre-training. We thus propose a method of grouping and pruning similar experts to improve the model's parameter efficiency. We validate the effectiveness of our method by pruning three state-of-the-art MoE architectures, including Mixtral, Deepseek-MoE, and Qwen. The evaluation shows that our method outperforms other model pruning methods on a range of natural language tasks. We will release our code to facilitate future research.
Paper Type: Long
Research Area: Efficient/Low-Resource Methods for NLP
Research Area Keywords: pruning
Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 608
Loading