An Adaptive Scheme of Threshold Adjustment for Dynamic Sparsity Extraction of Self-Attention Network
Abstract: Large Language Models (LLMs) and transformers have become highly successful across various domains. However, they are notorious for their quadratic computational complexity, which increases with sequence length. To mitigate this, dynamic sparsity techniques skip near-zero-output patterns based on low-precision estimations. Values below static thresholds are pruned, reducing energy consumption and improving computation speed.By utilizing low-cost estimations of minor threshold adjustments, we continuously monitor and fine-tune the pruning strategy to avoid overly aggressive pruning. Experimental results demonstrate that the proposed adaptive threshold method provides an average accuracy improvement of 0.15%, along with an average additional 8.95% computational sparsity across the SQuAD v1.1, v2, SST-2, and MRPC datasets.
External IDs:dblp:conf/aicas/XiaoCSLHCL25
Loading