Keywords: optimization for deep learning, directional derivative
TL;DR: We narrow down the guessing space of gradients by using knowledge architecture and data. This is useful for backprop-free optimization of neural networks
Abstract: How much can you say about the gradient of a neural network without computing a loss or knowing the label? This may sound like a strange question: surely the answer is "very little." However, in this paper, we show that gradients are more structured than previously thought. Gradients lie in a predictable low-dimensional subspace which depends on the network architecture and incoming features.
Exploiting this structure can significantly improve gradient-free optimization schemes based on directional derivatives, which until now have struggled to scale beyond small networks trained on MNIST. We study how to narrow the gap in optimization performance between methods that calculate exact gradients and those that use directional derivatives, demonstrate new phenomena that occur when using these methods, and highlight new challenges in scaling these methods.
Primary Area: optimization
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 4653
Loading