MoE Jetpack: From Dense Checkpoints to Adaptive Mixture of Experts for Vision Tasks

Published: 25 Sept 2024, Last Modified: 06 Nov 2024NeurIPS 2024 posterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: deep learning, computer vision, mixture of experts, weight initialization, fine-tuning
TL;DR: We introduce MoE Jetpack, a framework that effectively converts existing pre-trained dense checkpoints into sparsely activated MoE models, substantially improve convergence speed and accuracy for vision tasks.
Abstract: The sparsely activated mixture of experts (MoE) model presents an effective alternative to densely activated (dense) models, combining improved accuracy with computational efficiency. However, training MoE models from scratch requires extensive data and computational resources, a challenge that limits their widespread adoption. To address this, we introduce MoE Jetpack, a framework designed to fine-tune the abundant and easily accessible dense checkpoints into MoE models. MoE Jetpack incorporates two key techniques: (1) **checkpoint recycling**, which initializes MoE models with dense checkpoints to accelerate convergence and enhance accuracy, minimizing the need for extensive pre-training; (2) the **hyperspherical adaptive MoE (SpheroMoE) layer**, which optimizes the MoE architecture to enhance fine-tuning performance and efficiency. Experimental results indicate that MoE Jetpack doubles the convergence speed and enhances accuracy by 2.8% on ImageNet-1K. On smaller datasets, it achieves up to 8-fold faster convergence and over 30% accuracy gains, highlighting its efficiency. The code is available at https://github.com/Adlith/MoE-Jetpack.
Primary Area: Machine vision
Submission Number: 6943
Loading