Keywords: Mixture of Expert, Neural Memory, Pre-trained Language Model, NLP
Abstract: Large and sparse feed-forward networks (S-FFN) such as Mixture-of-Experts (MoE) have demonstrated to be an efficient approach for scaling up Transformers model size for pretraining. By only activating part of the FFN parameters con- ditioning on input, S-FFN improves generalization performance while keeping training and inference cost (in FLOPs) fixed. A growing body of work has been focusing on improving the S-FFN design, including routing and load balancing methods in the context of MoEs. Previously, another line of work motivates from a neural memory perspective and develops sparse neural memory techniques for S-FFN. This work merges the two seemingly different lines of work. We present a unified framework to categorize design choices along two axes: memory block size and memory block selection method. Using this unified framework, we compare several S-FFN architectures for language modeling and provide insights into their relative efficacy and efficiency. We show that a smaller memory block size leads to lower perplexity. Additionally, we find that selection through a gate, in general, improves the perplexity-FLOPs trade-off but has worse perplexity than selection using hidden states without a gate. Based on these insights, we propose a new selection method — Avg-K that selects blocks through their mean aggregated hidden states. With 1% additional FLOPs, Avg-K achieves 2.16 lower perplexity than a vanilla transformer (16.96), outperforming Switch Transformer (16.45).
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Applications (eg, speech processing, computer vision, NLP)
TL;DR: We present a unified framework for large and sparse feed-forward networks in transformer, and use it to arrive at a better method.
Supplementary Material: zip