Merging Feed-Forward Sublayer for Compressed Transformers

TMLR Paper7540 Authors

16 Feb 2026 (modified: 22 Jun 2026)Decision pending for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Pruning is a prevailing model compression method that typically operates by identifying and removing unimportant parameters based on various important metrics. In this work, we challenge this paradigm by instead targeting redundant parameters via intra-model merging techniques. Specifically, we propose a method that combines multiple feed-forward sublayers in Transformer models through neuron alignment, merging, and weight tying. We find that this method produces compressed models with performance comparable to their original models while tying more than a third of their feed-forward sublayers, and demonstrates improved performance over a strong, generalized layer pruning baseline. For example, we can remove more than 21% of the total parameters from a vision transformer while maintaining 99% of its original performance on ImageNet. Additionally, we observe high activation similarity between different feed-forward sublayers, offering novel insight into their behavior and contextualizing their surprising mergeability.
Submission Type: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: All changes are in blue, which reflects many of the writing changes requested by the reviewers.
Assigned Action Editor: ~Zhihui_Zhu1
Submission Number: 7540
Loading