ReMod: Learning Structured Sparsity with ReLU Modulation

Wenbo Zhang; Xiang Ren

ReMod: Learning Structured Sparsity with ReLU Modulation

Wenbo Zhang, Xiang Ren

Published: 06 Mar 2025, Last Modified: 05 Apr 2025MCDC @ ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: inference efficiency, sparsity, sparsification, mixture of experts, moefication, conditional computation, dynamic neural network, modularity, large language models

TL;DR: We introduce ReLU Modulation, a method that can convert dense modules into selectively computed MoEs smoothly and differentiably while integrating clustering directly into training.

Abstract: Large language models demand substantial computational resources for training and inference. Leveraging contextual sparsity to convert dense modules into sparsely computed Mixture of Experts (MoE) offers a promising solution, but existing methods face challenges in effectively partitioning modules and handling abrupt, non-differentiable changes during conversion. We introduce ReMod (ReLU Modulation), which creates sparsity smoothly and differentiably while integrating clustering directly into training. Our method trains a small ReLU-gated modulator that scales hidden states to sparsify computation, then clusters modulator weights to create structured sparsity with optimized hardware utilization. When applied to MLPs and Attention projections in Bert-base, ReMod reduces inference FLOPs by up to 93% while maintaining comparable accuracy—significantly outperforming previous approaches.

Submission Number: 51

Loading