What Apples Tell About Oranges: Connecting Pruning Masks and Hessian Eigenspaces

Andres Fernandez; Frank Schneider; Maren Mahsereci; Philipp Hennig

What Apples Tell About Oranges: Connecting Pruning Masks and Hessian Eigenspaces

Andres Fernandez, Frank Schneider, Maren Mahsereci, Philipp Hennig

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Primary Area: optimization

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: deep learning, pruning, Hessian, Grassmannians

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

TL;DR: We develop techniques to compare subspaces of deep network pruning masks and loss Hessians, and find that they overlap significantly, bridging the gap between first- and second-order methods.

Abstract: Recent studies have demonstrated that good pruning masks of neural networks emerge early during training, and that they remain largely stable thereafter. In a separate line of work, it has also been demonstrated that the eigenspace of the loss Hessian shrinks drastically during early training, and remains largely stable thereafter. While previous research establishes a direct relationship between individual network parameters and loss curvature at training convergence, in this study we investigate the connection between parameter pruning masks and Hessian eigenspaces, throughout the entire training process and with particular attention to their early stabilization. To quantify the similarity between these seemingly disparate objects, we cast them as orthonormal matrices from the same Stiefel manifold, each defining a linear subspace. This allows us to measure the similarity of their spans using Grassmannian metrics. In our experiments, we train a deep neural network and demonstrate that these two subspaces overlap significantly - well above random chance - throughout the entire training process and not just at convergence. This overlap is largest at initialization, and then drops and stabilizes, providing a novel perspective on the early stabilization phenomenon and suggesting that, in deep learning, largest parameter magnitudes tend to coincide with the directions of largest loss curvature. This early-stabilization and high-overlap phenomenon can be leveraged to approximate the typically intractable top Hessian subspace via parameter inspection, at only linear cost. The connection between parameters and loss curvatures also offers a fresh perspective on existing work, tending a bridge between first- and second-order methods.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 5785

Loading