Keywords: inference efficiency, sparsity, sparsification, mixture of experts, moefication, conditional computation, dynamic neural network, modularity, large language models
TL;DR: We introduce ReLU Modulation, a method that can convert dense modules into selectively computed MoEs smoothly and differentiably while integrating clustering directly into training.
Abstract: Large language models demand substantial computational resources for training and inference. Leveraging contextual sparsity to convert dense modules into sparsely computed Mixture of Experts (MoE) offers a promising solution, but existing methods face challenges in effectively partitioning modules and handling abrupt, non-differentiable changes during conversion. We introduce ReMod (ReLU Modulation), which creates sparsity smoothly and differentiably while integrating clustering directly into training. Our method trains a small ReLU-gated modulator that scales hidden states to sparsify computation, then clusters modulator weights to create structured sparsity with optimized hardware utilization. When applied to MLPs and Attention projections in Bert-base, ReMod reduces inference FLOPs by up to 93% while maintaining comparable accuracy—significantly outperforming previous approaches.
Submission Number: 51
Loading