Zeroth-Order Optimization is Secretly Single-Step Policy Optimization

Junbin Qiu; Zhengpeng Xie; Xiangda Yan; Yongjie Yang; Yao Shu

Zeroth-Order Optimization is Secretly Single-Step Policy Optimization

Junbin Qiu, Zhengpeng Xie, Xiangda Yan, Yongjie Yang, Yao Shu

Published: 10 Jun 2025, Last Modified: 01 Jul 2025TTODLer-FM @ ICML 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Zeroth-Order Optimization, Variance Reduction, Convergence

TL;DR: This paper reveals a fundamental equivalence between Zeroth-Order Optimization and Policy Optimization and introduces ZoAR that significantly improves performance by incorporating PO-inspired variance reduction.

Abstract: Zeroth-Order Optimization (ZOO) provides pow- erful tools for optimizing functions where explicit gradients are unavailable or expensive to compute. However, the underlying mechanisms of popu- lar ZOO methods, particularly those employing randomized finite differences, and their connection to other optimization paradigms like Rein- forcement Learning (RL) are not fully elucidated. This paper establishes a fundamental and previ- ously unrecognized connection: ZOO with finite differences is equivalent to a specific instance of single-step Policy Optimization (PO). We formally unveil that the implicitly smoothed objec- tive function optimized by common ZOO algo- rithms is identical to a single-step PO objective. Furthermore, we show that widely used ZOO gra- dient estimators, are mathematically equivalent to the REINFORCE gradient estimator with a specific baseline function, revealing the variance- reducing mechanism in ZOO from a PO perspec- tive.Built on this unified framework, we propose ZoAR (Zeroth-Order Optimization with Aver- aged Baseline and Query Reuse), a novel ZOO algorithm incorporating PO-inspired variance re- duction techniques: an averaged baseline from recent evaluations and query reuse analogous to experience replay. Our theoretical analysis further substantiates these techniques reduce variance and enhance convergence. Extensive empirical studies on synthetic benchmarks, black-box adversarial attacks, and memory-efficient finetuning of large language models (LLMs) validate our theory and demonstrate that ZoAR significantly outperforms other methods in terms of convergence speed and final performance. Overall, our work provides a new theoretical lens for understanding ZOO and offers practical algorithmic improvements derived from its connection to PO.

Submission Number: 17

Loading