TASP: Preserving Training Dynamics in Transformers via NTK-Aware Structured Pruning

Mengting Ai; Tianxin Wei; Jingrui He

TASP: Preserving Training Dynamics in Transformers via NTK-Aware Structured Pruning

Mengting Ai, Tianxin Wei, Jingrui He

Published: 05 Mar 2025, Last Modified: 24 Apr 2025SLLMEveryoneRevisionsBibTeXCC BY 4.0

Track: long paper (up to 4 pages)

Keywords: Pruning, Model Compression, language model

Abstract: Structured pruning of large-scale Transformer models promises substantial efficiency gains by removing entire hidden units. However, such pruning often degrades accuracy more than unstructured pruning, necessitating compensation strategies such as supervised fine-tuning (SFT) or adapter modules (e.g., LoRA). In this paper, we introduce TASP (Neural Tangent Kernel-Aware Structured Pruning), a novel method that identifies and prunes low-saliency hidden units in Transformer. Our approach computes a saliency score for each weight—as the product of the weight and its partial derivative with respect to the network output—and aggregates these scores to measure the contribution of each hidden unit. We prove, via a piecewise-linear bounding argument, that pruning units with minimal saliency preserves the network’s Neural Tangent Kernel (NTK) and, consequently, its training dynamics under Adam-based optimization. Empirical results on standard benchmarks confirm that TASP achieves significant model compression while maintaining training performance, offering a theoretically grounded and efficient pathway for Transformer model compression.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Submission Number: 20

Loading