Improving Sample Efficiency in Evolutionary RL Using Off-Policy Ranking

Eshwar S. R; Shishir Kolathaya; Gugan Thoppe

Improving Sample Efficiency in Evolutionary RL Using Off-Policy Ranking

Eshwar S. R, Shishir Kolathaya, Gugan Thoppe

Published: 01 Jan 2023, Last Modified: 16 May 2025VALUETOOLS 2023EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Evolution Strategy (ES) is a potent black-box optimization technique based on natural evolution. A key step in each ES iteration is the ranking of candidate solutions based on some fitness score. In the Reinforcement Learning (RL) context, this step entails evaluating several policies. Presently, this evaluation is done via on-policy approaches: each policy’s score is estimated by interacting several times with the environment using that policy. Such ideas lead to wasteful interactions since, once the ranking is done, only the data associated with the top-ranked policies are used for subsequent learning. To improve sample efficiency, we introduce a novel off-policy ranking approach using a local approximation for the fitness function. We demonstrate our idea for two leading ES methods: Augmented Random Search (ARS) and Trust Region Evolution Strategy (TRES). MuJoCo simulations show that, compared to the original methods, our off-policy variants have similar running times for reaching reward thresholds but need only around 70% as much data on average. In fact, in some tasks like HalfCheetah-v3 and Ant-v3, we need just 50% as much data. Notably, our method supports extensive parallelization, enabling our ES variants to be significantly faster than popular non-ES RL methods like TRPO, PPO, and SAC.

Loading