MoE-SVD: Structured Mixture-of-Experts LLMs Compression via Singular Value Decomposition

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: In this paper, we present a novel training-free compressor for MoE LLMs that uses SVD to break experts into smaller matrices, sharing and trimming them to save memory, speed up inference.
Abstract: Mixture of Experts (MoE) architecture improves Large Language Models (LLMs) with better scaling, but its higher parameter counts and memory demands create challenges for deployment. In this paper, we present MoE-SVD, a new decomposition-based compression framework tailored for MoE LLMs without any extra training. By harnessing the power of Singular Value Decomposition (SVD), MoE-SVD addresses the critical issues of decomposition collapse and matrix redundancy in MoE architectures. Specifically, we first decompose experts into compact low-rank matrices, resulting in accelerated inference and memory optimization. In particular, we propose selective decomposition strategy by measuring sensitivity metrics based on weight singular values and activation statistics to automatically identify decomposable expert layers. Then, we share a single V-matrix across all experts and employ a top-k selection for U-matrices. This low-rank matrix sharing and trimming scheme allows for significant parameter reduction while preserving diversity among experts. Comprehensive experiments on Mixtral, Phi-3.5, DeepSeek, and Qwen2 MoE LLMs show MoE-SVD outperforms other compression methods, achieving a 60\% compression ratio and 1.5× faster inference with minimal performance loss. Codes are available at: https://github.com/lliai/MoE-SVD.
Lay Summary: We introduce MoE-SVD, a decomposition-based compression approach specifically designed for Mixture of Experts (MoE) Large Language Models (LLMs). Leveraging Singular Value Decomposition (SVD), our method reduces parameter redundancy and memory requirements without requiring additional training. We propose selective decomposition using sensitivity metrics, employing a shared V-matrix across experts and trimming U-matrices through top-k selection. Experiments conducted on various MoE models such as Mixtral, Phi-3.5, DeepSeek, and Qwen2 demonstrate a 60% compression ratio and 1.5× faster inference speed with minimal performance degradation.
Primary Area: Deep Learning->Large Language Models
Keywords: Mixture of Experts, Efficient Large Language Models, Low-Rank Decomposition, Network Sparsity
Submission Number: 95
Loading