MoDE: Weight Denoising Towards Better LLM Performance through a Mixture of Domain Experts

Yuchen Xian; Yixuan Han; Fan Ma; Yi Yang

MoDE: Weight Denoising Towards Better LLM Performance through a Mixture of Domain Experts

Yuchen Xian, Yixuan Han, Fan Ma, Yi Yang

16 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Mixture of Experts, Large Language Models

TL;DR: Pruning noise weights boosts in-domain accuracy; MoDE ensures cross-domain generalization.

Abstract: Pruning in large language models (LLMs) is widely assumed to degrade performance, since most weights are considered essential contributors to model capacity; thus, existing methods primarily rely on training to retain accuracy. However, our findings show that weight importance is domain-dependent rather than globally consistent, revealing the existence of noise weights whose removal can enhance domain-specific performance. To this end, we first present the DENoise (Domain Expert weight deNoising) algorithm, which effectively removes domain-aware noise weights without requiring fine-tuning to achieve improvement; We further develop MoDE (Mixture of Domain Experts), which treats these in-domain optimal denoised models as experts and employs a bilevel trainable router to dynamically activate them, thereby enhancing out-of-domain generalization. Experimental results show that applying the DENoise algorithm yields 2--3\% gains across benchmarks such as MMLU, MBPP, and GSM8K, while MoDE achieves an average improvement of over 1.1\% against baseline models, all without introducing additional parameters or tuning overhead.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 7159

Loading