Abstract: Mixture of Experts (MoE) has great potential in scaling up the capacity of large models while maintaining low computational costs. Recent works have focused on reducing expert-level redundancy by designing various token allocation strategies within gating functions. Whereas, the intricate internal relationships between experts cause knowledge redundancy at the fine-grained neuron level, and research on collaboration among experts remains scarce.
In this paper, we propose a Information Bottleneck based MoE (IBMoE) for parameter-efficient fine-tuning, which reduces neuron-level redundancy within each expert and fosters internal collaboration among all experts. Specifically, a sparse neuronal activation strategy is introduced to dynamically activate the relevant neurons while reducing the redundancy when processing different tasks. In addition, a diversity constraint is imposed among experts, which maximizes the knowledge difference to enable all experts cooperative more efficiently.
Extensive experiments demonstrate the great advantages of our method. We achieve superior performance while reducing inference time by 63\% and memory consumption by 48.5\% compared to the recent baselines. Our code will be publicly accessible in the future.
Paper Type: Long
Research Area: Machine Learning for NLP
Research Area Keywords: representation learning
Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 6966
Loading