Keywords: Mixture of Experts, Large Language Models
TL;DR: Pruning noise weights boosts in-domain accuracy; MoDE ensures cross-domain generalization.
Abstract: Pruning in large language models (LLMs) is widely assumed to degrade performance, since most weights are considered essential contributors to model capacity; thus, existing methods primarily rely on training to retain accuracy. However, our findings show that weight importance is domain-dependent rather than globally consistent, revealing the existence of noise weights whose removal can enhance domain-specific performance. To this end, we first present the DENoise (Domain Expert weight deNoising) algorithm, which effectively removes domain-aware noise weights without requiring fine-tuning to achieve improvement; We further develop MoDE (Mixture of Domain Experts), which treats these in-domain optimal denoised models as experts and employs a bilevel trainable router to dynamically activate them, thereby enhancing out-of-domain generalization. Experimental results show that applying the DENoise algorithm yields 2--3\% gains across benchmarks such as MMLU, MBPP, and GSM8K, while MoDE achieves an average improvement of over 1.1\% against baseline models, all without introducing additional parameters or tuning overhead.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 7159
Loading