Doubly Sparse: Sparse Mixture of Sparse Experts for Efficient Softmax Inference

Shun Liao; Ting Chen; Tian Lin; Chong Wang; Dengyong Zhou

Doubly Sparse: Sparse Mixture of Sparse Experts for Efficient Softmax Inference

Shun Liao, Ting Chen, Tian Lin, Chong Wang, Dengyong Zhou

27 Sept 2018 (modified: 22 Jun 2025)ICLR 2019 Conference Blind SubmissionReaders: Everyone

Abstract: Computations for the softmax function in neural network models are expensive when the number of output classes is large. This can become a significant issue in both training and inference for such models. In this paper, we present Doubly Sparse Softmax (DS-Softmax), Sparse Mixture of Sparse of Sparse Experts, to improve the efficiency for softmax inference. During training, our method learns a two-level class hierarchy by dividing entire output class space into several partially overlapping experts. Each expert is responsible for a learned subset of the output class space and each output class only belongs to a small number of those experts. During inference, our method quickly locates the most probable expert to compute small-scale softmax. Our method is learning-based and requires no knowledge of the output class partition space a priori. We empirically evaluate our method on several real-world tasks and demonstrate that we can achieve significant computation reductions without loss of performance.

Keywords: hierarchical softmax, model compression

TL;DR: We present doubly sparse softmax, the sparse mixture of sparse of sparse experts, to improve the efficiency for softmax inference through exploiting the two-level overlapping hierarchy.

Data: [WikiText-2](https://paperswithcode.com/dataset/wikitext-2)

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 2 code implementations](https://www.catalyzex.com/paper/doubly-sparse-sparse-mixture-of-sparse/code)

8 Replies

Loading