ECO grad: Error Correcting Optimization for Quasi-Gradients, a Variable Metric DFO Strategy

ICLR 2026 Conference Submission23243 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: DFO, Derivative Free Optimization, ZO, Zeroth Order, Jacobian Vector Products, JVPs, Directional Derivatives, Evolutionary Strategies, Simulation, Optimization, Model-Free, Reinforcement Learning, Policy functions, Policy gradient, Quasi-Newton, Quasi-Gradient, NoGrad, no grad, black box, gradient interpolation, secant constraint, directional probes, BFGS, Broyden, DFP, Isotropic Distribution, Ball surface, ellipse, uniform sphere distributed, LMS filter, Least Change Update
TL;DR: This is about updating a gradient estimator actively in a quasi-Newton manner (quasi-gradient), and improving (reducing) dimensional dependence for potentially all DFO and no-grad problems. A feasible path to scaled up no-grad?
Abstract: We introduce a \textit{Quasi-Gradient} method using 0th order directional derivatives and quasi-Newton like updates. Empirically, our method reduces $d$-dependence of zeroth-order problems to an effective $\approx d \cdot m$ factor $1/d \le m \le 1$, with only a small linear increase in compute. We show this holds under Lipschitz bounds and on practical tasks. While compressive sensing achieves similar gains with sparse gradients, our approach applies to any gradient geometry. It exploits high cosine similarity and stable gradient norms along neighboring steps, ultimately requiring fewer samples to correct the estimator. Applications include policy optimization, model-free reinforcement learning, function smoothing, evolutionary methods, efficient JVPs (e.g. in JAX), learning from simulation, and related areas. We include a probing framework that leverages convergence bounds to detect when a gradient estimator is no longer aligned with new samples, helping prevent non-descent steps. We also introduce the \textit{ECO estimator} a least-change secant update that results in a specific LMS adaption, which achieves $O(e^{-k/d})$ convergence in gradient MSE, while Monte Carlo averaging is sub-exponential $O(\frac{d+1}{d+k+1})$. Finally we provide performance results comparing directional SGD to quasi-GD, alone and with adaptive optimizers. As models grow, our approach bridges the gap between full-gradient methods and large scale derivative free optimization. We hope to motivate further research in quasi-gradient techniques for simulation and exploratory learning.
Supplementary Material: zip
Primary Area: optimization
Submission Number: 23243
Loading