Towards A Unified View of Sparse Feed-Forward Network in Transformer

Zeyu Liu; Tim Dettmers; Xi Victoria Lin; Veselin Stoyanov; Xian Li

Towards A Unified View of Sparse Feed-Forward Network in Transformer

Zeyu Liu, Tim Dettmers, Xi Victoria Lin, Veselin Stoyanov, Xian Li

22 Sept 2022 (modified: 22 Jun 2025)ICLR 2023 Conference Withdrawn SubmissionReaders: Everyone

Keywords: Mixture of Expert, Neural Memory, Pre-trained Language Model, NLP

Abstract: Large and sparse feed-forward networks (S-FFN) such as Mixture-of-Experts (MoE) have demonstrated to be an efficient approach for scaling up Transformers model size for pretraining. By only activating part of the FFN parameters con- ditioning on input, S-FFN improves generalization performance while keeping training and inference cost (in FLOPs) fixed. A growing body of work has been focusing on improving the S-FFN design, including routing and load balancing methods in the context of MoEs. Previously, another line of work motivates from a neural memory perspective and develops sparse neural memory techniques for S-FFN. This work merges the two seemingly different lines of work. We present a unified framework to categorize design choices along two axes: memory block size and memory block selection method. Using this unified framework, we compare several S-FFN architectures for language modeling and provide insights into their relative efficacy and efficiency. We show that a smaller memory block size leads to lower perplexity. Additionally, we find that selection through a gate, in general, improves the perplexity-FLOPs trade-off but has worse perplexity than selection using hidden states without a gate. Based on these insights, we propose a new selection method — Avg-K that selects blocks through their mean aggregated hidden states. With 1% additional FLOPs, Avg-K achieves 2.16 lower perplexity than a vanilla transformer (16.96), outperforming Switch Transformer (16.45).

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Submission Guidelines: Yes

Please Choose The Closest Area That Your Submission Falls Into: Applications (eg, speech processing, computer vision, NLP)

TL;DR: We present a unified framework for large and sparse feed-forward networks in transformer, and use it to arrive at a better method.

Supplementary Material: zip

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/towards-a-unified-view-of-sparse-feed-forward/code)

30 Replies

Loading