Information-Efficient Transformers via Adaptive Token Pruning

16 Sept 2025 (modified: 08 Oct 2025)Submitted to Agents4ScienceEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Transformers, token pruning, entropy-guided pruning, adaptive computation, efficient inference, long-context modeling, sparsity, FLOPs reduction, uncertainty calibration, SST-2
TL;DR: Entropy-guided token pruning halves tokens: ~37.5% FLOPs cut on synthetic (accuracy preserved/slightly better) and ~40% on SST-2 at ρ=0.75 with moderate accuracy drop (0.914→0.827).
Abstract: Transformers suffer from quadratic attention cost, limiting deployment for long contexts on CPUs and edge devices. We propose an entropy-guided token pruning mechanism that retains a fixed budget of tokens after an initial attention layer, using predictive entropy as a proxy for informativeness. In controlled NumPy simulations on synthetic sequences (L=64, V=500), pruning to ρ≈0.5 reduces a two-layer FLOPs proxy by 37.5% while maintaining accuracy (0.551) and AUC (0.556), slightly exceeding both a full encoder and an attention-mass baseline. On SST-2, a PyTorch implementation with ρ=0.75 reduces estimated FLOPs by∼40% with accuracy 0.827 (vs. 0.914 baseline), illustrating a practical efficiency–accuracy trade-off. We release code and artifacts for both synthetic and real-data tracks, and analyze calibration, oracle-overlap, and gate overhead. Our findings suggest entropy-guided pruning is a viable efficiency primitive, with optimal budgets depending on task structure and calibration quality.
Supplementary Material: zip
Submission Number: 244
Loading