Keywords: Robustness, Transformer Pruning, Certified Robustness
TL;DR: RAHP is a pruning framework for Transformers that jointly optimizes certified robustness and accuracy by removing attention heads based on a composite of CLEVER and Fisher Information, yielding smaller and more robust models.
Abstract: Transformers lie at the core of modern AI, yet their susceptibility to adversarial perturbations raises reliability concerns. Empirical defenses often lack guarantees, while certification-based approaches provide them at nontrivial computational cost. We introduce RAHP (Robustness-Aware Head Pruning), a certification-guided pruning framework for Transformers. RAHP scores each attention head with a composite of (i) $\Delta$CLEVER, the predicted increase in a certified-robustness lower bound when masking that head, and (ii) Fisher information, the estimated accuracy cost of removing it. We prune heads that maximize robustness gain per accuracy cost. Across evaluated tasks, RAHP yields compact models with stronger CLEVER lower bounds and minimal change in clean accuracy, and it improves resistance to a wide variety of strong attacks. By leveraging a certified metric to steer structural pruning, RAHP makes certification-oriented robustness more practical and scalable.
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 5448
Loading