Abstract: Mixture of Experts is a well-known technique in machine learning and is widely used to empower large language models. Unfortunately, it requires a lot of resources to train experts. To weaken this requirement, we propose a modification to the architecture of pretrained LLMs we call self Mixture of Experts (self-MoE), which is a mixture of experts with all the experts being the same exact model. This adjustment adds a handful of weights and yields a significant improvement in model performance. We have evaluated self-MoE on two main tracks: mathematical reasoning and code generation, and observed a significant improvement across various benchmarks. We will publish the training code and the model weights upon acceptance.
Paper Type: Long
Research Area: Machine Learning for NLP
Research Area Keywords: generalization, mathematical NLP, code generation and understanding
Contribution Types: NLP engineering experiment, Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models
Languages Studied: English
Submission Number: 1510
Loading