HORST: Composing Optimizer Geometries for Sparse Transformer Training

Published: 29 May 2026, Last Modified: 29 May 2026HiLD at ICML 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Sparsity, Optimization, Steepest descent, Mirror descent, Transformers, LLMs, Composition, Modularity
TL;DR: HORST is a modular optimizer built based on the non-commutative compositions of optimization steps, combining adaptive training stability with an L1 sparsity bias for robust sparse transformer training.
Abstract: Sparsifying transformers remains a fundamental challenge, as standard optimizers fail to simultaneously encourage sparsity and maintain training stability. Effective adaptive optimizers exhibit an implicit $L_{\infty}$ bias favoring stability, yet, sparsity requires an $L_1$ bias. To integrate sparsity, we propose a composition of optimizer steps, which we cast as non-commutative operators to analyze and combine their optimization geometry in a principled way. This yields HORST (Hyperbolic Operator for Robust Sparse Training), a modular optimizer that inherits stability from adaptive methods while inducing $L_1$ sparsity bias through a hyperbolic mirror map. Our experiments demonstrate its utility for sparse training of transformers on both vision and language tasks. HORST significantly outperforms AdamW baselines across all sparsity levels, with large gains at higher sparsity.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 186
Loading