MoEfication: Conditional Computation of Transformer Models for Efficient Inference

Anonymous

MoEfication: Conditional Computation of Transformer Models for Efficient Inference

Anonymous

16 Nov 2021 (modified: 05 May 2023)ACL ARR 2021 November Blind SubmissionReaders: Everyone

Abstract: Transformer-based pre-trained language models achieve superior performance on most NLP tasks due to large parameter capacity, but also lead to huge computation cost. Fortunately, we observe that most inputs only activate a tiny ratio of neurons of large Transformer-based pre-trained models during inference. Hence, we propose to convert a model into its mixture-of-experts (MoE) version with the same parameters, namely MoEfication, which accelerates large-model inference by conditional computation based on the sparse activation phenomenon. Specifically, MoEfication consists of two phases: (1) splitting the parameters of feed-forward neural networks (FFNs) into multiple parts as experts, and (2) building expert routers to decide which experts will be used for each input. Experimental results show that MoEfication can save $80\%$ computation cost of FFNs while maintaining over $95\%$ original performance for different models, including models with different sizes (up to 3 billion parameters) and distilled models, on various downstream tasks. Moreover, we find that the MoEfied model achieves better performance than the MoE model pre-trained from scratch with the same model size. We will release all the code and models of this paper.

0 Replies

Loading