- Keywords: Derivative free optimization, reinforcement learning
- Abstract: Zeroth-order (ZO, also known as derivative-free) methods, which estimate a noisy gradient based on the finite difference with two function evaluations, have attracted much attention recently because of its broad applications in machine learning community. The function evaluations are normally requested on a point plus a random perturbations drawn from a (standard Gaussian) distribution. The accurateness of noisy gradient highly depends on how many perturbations randomly sampled from the distribution, which intrinsically conflicts to the efficiency of ZO algorithms. Although there have been much effort made to improve the efficiency of ZO algorithms, however, we explore a new direction, i.e., learn an optimal sampling policy based on reinforcement learning (RL) to generate perturbation instead of using totally random strategy, which make it possible to calculate a ZO gradient with only 2 function evaluations. Specifically, we first formulate the problem of learning a sampling policy as a Markov decision process. Then, we propose our ZO-RL algorithm, i.e., using deep deterministic policy gradient, an actor-critic RL algorithm to learn a sampling policy which can guide the generation of perturbed vectors in getting ZO gradients as accurate as possible. Since our method only affects the generation of perturbed vectors which is parallel to existing efforts of accelerating ZO methods such as learning a data driven Gaussian distribution, we show how to combine our method with other acceleration techniques to further improve the efficiency of ZO algorithms. Experimental results with different ZO estimators show that our ZO-RL algorithm can effectively reduce the query complexity of ZO algorithms especially in the later stage of the optimization process, and converge faster than existing ZO algorithms.