PartialFormer: Modeling Part Instead of Whole for Machine Translation

16 Jun 2023 (modified: 01 Dec 2023)Submitted to EMNLP 2023EveryoneRevisionsBibTeX
Submission Type: Regular Long Paper
Submission Track: Machine Translation
Keywords: Lightweight Transformer;
Abstract: The parameter redundancy problem in Transformer models has been widely acknowledged in the literature. To address this weakness, we introduce PartialFormer, a parameter-efficient Transformer architecture for machine translation. Compared to previous parameter-efficient Transformer architecture, PartialFormer modifies the modeling strategy of the feed-forward network to allow it to spare tremendous parameters while maintaining large hidden dimension. Additionally, PartialFormer applies two efficient scaling strategies, namely depth scaling and width scaling, to improve performance within a given parameter budget. To efficiently benefit from these scaling strategies, PartialFormer is further enhanced by two cost-effective modifications: 1) a head scaling strategy for efficient width scaling and 2) a residual-like attention calculation for better depth scaling. Extensive experiments on 9 translation tasks validate the effectiveness of our PartialFormer approach.
Submission Number: 2584
Loading