Abstract: Mixture-of-Experts models have low computational cost despite having a large number of parameters.
However, the problem of unbalanced expert selection during routing leads to inefficient use of parameters.
Thus, an auxiliary loss is used to make the expert selection uniform, but it has been found that this interferes with the performance of the language model.
In this study, we propose a supervised learning approach to Mixture-of-Experts routing using token frequencies as the supervised signal.
This method aims to align the expert selection with the knowledge they have acquired.
As a case study, we focus on domain adaptation for law.
The proposed method without the auxiliary loss achieved results comparable to the baseline with the auxiliary loss.
Paper Type: Short
Research Area: Language Modeling
Research Area Keywords: continual learning, fine-tuning
Contribution Types: NLP engineering experiment
Languages Studied: Japanese
Submission Number: 1010
Loading