The Feature-Space Alignment Hypothesis for Neural Network Sparsity

Linghao Kong; Micah Adler; Nir N Shavit

The Feature-Space Alignment Hypothesis for Neural Network Sparsity

Linghao Kong, Micah Adler, Nir N Shavit

Published: 02 Mar 2026, Last Modified: 30 Apr 2026Sci4DL 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Neural Network Sparsity, Feature Representations, Oracle Analysis, Sparse Topology, Pruning, Mechanistic Interpretability

TL;DR: We provide a mechanistic explanation of neural network sparsity by showing that accuracy collapses when pruning destroys feature-space representations, and that preserving these representations enables performance at high sparsities.

Abstract: Why does something as simple as magnitude pruning succeed even when most weights are removed, and why does it eventually fail? We study this in a controlled setting where the network's feature-space weights are explicit. We find that accuracy collapse under standard pruning coincides with divergence of this feature-space representation from its dense counterpart. We then introduce an optimization oracle that selects sparse weight matrices independent of the original that explicitly preserve the latent feature structure. Under identical retraining budgets, the oracle recovers performance at sparsities where standard pruning degrades across both logic and vision tasks. This suggests that sparsification limits can arise from misalignment between weight-space and feature-level structure, and points to feature-aware criteria as a path toward improved pruning methods.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Style Files: I have used the style files.

Submission Number: 45

Loading