ReMod: Learning Structured Sparsity with ReLU Modulation

Published: 06 Mar 2025, Last Modified: 05 Apr 2025MCDC @ ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: inference efficiency, sparsity, sparsification, mixture of experts, moefication, conditional computation, dynamic neural network, modularity, large language models
TL;DR: We introduce ReLU Modulation, a method that can convert dense modules into selectively computed MoEs smoothly and differentiably while integrating clustering directly into training.
Abstract: Large language models demand substantial computational resources for training and inference. Leveraging contextual sparsity to convert dense modules into sparsely computed Mixture of Experts (MoE) offers a promising solution, but existing methods face challenges in effectively partitioning modules and handling abrupt, non-differentiable changes during conversion. We introduce ReMod (ReLU Modulation), which creates sparsity smoothly and differentiably while integrating clustering directly into training. Our method trains a small ReLU-gated modulator that scales hidden states to sparsify computation, then clusters modulator weights to create structured sparsity with optimized hardware utilization. When applied to MLPs and Attention projections in Bert-base, ReMod reduces inference FLOPs by up to 93% while maintaining comparable accuracy—significantly outperforming previous approaches.
Submission Number: 51
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview