Self-MoE: Self Mixture of Experts in between decoder layers

Self-MoE: Self Mixture of Experts in between decoder layers

ACL ARR 2024 June Submission1510 Authors

14 Jun 2024 (modified: 02 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Mixture of Experts is a well-known technique in machine learning and is widely used to empower large language models. Unfortunately, it requires a lot of resources to train experts. To weaken this requirement, we propose a modification to the architecture of pretrained LLMs we call self Mixture of Experts (self-MoE), which is a mixture of experts with all the experts being the same exact model. This adjustment adds a handful of weights and yields a significant improvement in model performance. We have evaluated self-MoE on two main tracks: mathematical reasoning and code generation, and observed a significant improvement across various benchmarks. We will publish the training code and the model weights upon acceptance.

Paper Type: Long

Research Area: Machine Learning for NLP

Research Area Keywords: generalization, mathematical NLP, code generation and understanding

Contribution Types: NLP engineering experiment, Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models

Languages Studied: English

Submission Number: 1510

Loading