Effects of Attention Head Pruning on Encoder-only Language Models for Multilingual Recipe Classification
Abstract: Pruning, as a method of reducing model size and improving performance, has gained increasing traction in recent years. Models such as BERT have been shown to be over-parametrized: multiple attention heads encode the same patterns. Additionally, models trained on general data might underperform on domain-specific tasks, such as recipe interpretation. In this work, we explore the effects of score-based attention head pruning in multilingual transformer models. We conduct our experiments on a dataset comprised of six Indo-European languages with unequal representation across languages, on three tasks of varying difficulty. We grant each attention head a score, based on its contribution to overall model performance, then we evaluate the impact of successive pruning based on said score. Our findings suggest that substantial pruning (up to $80 \%$) can be performed without major performance loss when applied post-finetuning. We show that easier tasks show slower performance degradation as the percentage of pruned heads increases. We also report consistent reductions in inference time. Contrary to our expectations, low-resource languages did not suffer a significantly faster performance degradation when pruning.
External IDs:dblp:conf/sped/AnghelusNLP25
Loading