Differentiable Attention Sparsity via Structured $D$-Gating

Chris Kolb; Bernd Bischl; David Rügamer

Differentiable Attention Sparsity via Structured $D$-Gating

Chris Kolb, Bernd Bischl, David Rügamer

Published: 05 Mar 2025, Last Modified: 02 Apr 2025SLLMEveryoneRevisionsBibTeXCC BY 4.0

Track: long paper (up to 4 pages)

Keywords: Sparsity, Attention, Non-Smoothness, Optimization

Abstract: A core component of modern large language models is the attention mechanism, but its immense parameter count necessitates structured sparsity for resource-efficient optimization and inference. Traditional sparsity penalties, such as the group lasso, are non-smooth and thus incompatible with standard stochastic gradient descent methods. To address this, we propose a deep gating mechanism that reformulates the structured sparsity penalty into a fully differentiable optimization problem, allowing effective and principled norm-based group sparsification without requiring specialized non-smooth optimizers. Our theoretical analysis and empirical results demonstrate that this approach enables structured sparsity with simple stochastic gradient descent or variants while maintaining predictive performance.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Submission Number: 48

Loading