TASP: Preserving Training Dynamics in Transformers via NTK-Aware Structured Pruning

Published: 05 Mar 2025, Last Modified: 24 Apr 2025SLLMEveryoneRevisionsBibTeXCC BY 4.0
Track: long paper (up to 4 pages)
Keywords: Pruning, Model Compression, language model
Abstract: Structured pruning of large-scale Transformer models promises substantial efficiency gains by removing entire hidden units. However, such pruning often degrades accuracy more than unstructured pruning, necessitating compensation strategies such as supervised fine-tuning (SFT) or adapter modules (e.g., LoRA). In this paper, we introduce TASP (Neural Tangent Kernel-Aware Structured Pruning), a novel method that identifies and prunes low-saliency hidden units in Transformer. Our approach computes a saliency score for each weight—as the product of the weight and its partial derivative with respect to the network output—and aggregates these scores to measure the contribution of each hidden unit. We prove, via a piecewise-linear bounding argument, that pruning units with minimal saliency preserves the network’s Neural Tangent Kernel (NTK) and, consequently, its training dynamics under Adam-based optimization. Empirical results on standard benchmarks confirm that TASP achieves significant model compression while maintaining training performance, offering a theoretically grounded and efficient pathway for Transformer model compression.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 20
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview