HEADS: Head-Wise Efficient and Adaptive Sparsification for Transformer-Based Models

Published: 2025, Last Modified: 05 Nov 2025AICAS 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Dynamic token pruning has played an increasingly important role in reducing computational complexity and memory demands for the multi-head attention mechanism in Transformer-Based models. However, coarse-grained sparsification in existing methods overlooks contextual nuances, limiting hardware efficiency and optimization of model accuracy, complexity, and energy consumption. To address these limitations, we propose a head-wise adaptive sparsification scheme that selectively removes less relevant tokens for each attention head. Such finer-grained, context-aware pruning, with an adaptive pruning rate, improves inference latency while effectively maintaining model accuracy. Moreover, the sparse feed-forward networks can be mapped to head-wise dense matrix multiplications, further enhancing hardware utilization and reducing energy consumption. Experimental results on the BERT-Base model demonstrate that our approach achieves an average of 57.43% reduction in energy-delay product compared to the unpruned baseline, outperforming the state-of-the-art by 22.34%.
Loading