Fishers for Free? Approximating the Fisher Information Matrix by Recycling the Squared Gradient Accumulator
Abstract: The diagonal of a model's Fisher Information Matrix (the "Fisher") has frequently been used as a way to measure parameter sensitivity.
Typically, the Fisher is estimated by computing the squared gradient of the model's outputs with respect to its parameters, averaged over a few hundred or thousand examples — a process which incurs nontrivial computational costs.
At the same time, adaptive gradient methods like the ubiquitous Adam optimizer compute a moving average of the squared gradient over the course of training.
This paper therefore explores whether an approximation of the Fisher can be obtained "for free" by recycling the squared gradient accumulator that has already been computed over the course of training.
Through a comprehensive set of experiments covering five applications of the Fisher, we demonstrate that the "Squisher" (**Squ**ared gradient accumulator as an approximation of the F**isher**) consistently performs similarly to the Fisher while outperforming baseline methods.
Additionally, we clarify the exact differences between the Squisher and the Fisher and provide empirical quantification of their respective impact.
Lay Summary: Understanding which parts of a neural network are most important (i.e., which parameters matter most) can help with tasks like model merging, pruning, transfer learning, and continual learning. A popular tool for this is the diagonal of the Fisher Information Matrix, which we refer to as the Fisher. But calculating it can be expensive — it requires extra computation on hundreds or thousands of examples.
In this paper, we ask if we get a good-enough version of the Fisher without paying the full price. Surprisingly, the answer is yes. During training, widely used optimizers like Adam already keep track of a similar quantity: the squared gradients of the model's parameters. This approximation, which we call Squisher (**Squ**ared gradient accumulator as an approximation of the F**isher**), requires no extra computation or memory and is readily available “for free.”
Across five common applications of the Fisher, we show that Squisher produces results comparable to the original Fisher method, but with significantly lower computational cost. It saves time and resources, making it easier to apply these techniques at scale.
Primary Area: Deep Learning->Theory
Keywords: Fisher, optimizers, gradient accumulators
Submission Number: 7676
Loading