Sample-efficient reinforcement learning for environments with rare high-reward states

Daniel G Mastropietro; Urtzi Ayesta; Matthieu Jonckheere

Sample-efficient reinforcement learning for environments with rare high-reward states

Daniel G Mastropietro, Urtzi Ayesta, Matthieu Jonckheere

Published: 01 Aug 2024, Last Modified: 09 Oct 2024EWRL17EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Fleming-Viot, Actor-Critic, policy gradient, stochastic optimisation

TL;DR: Fleming-Viot Actor-Critic is presented, a sample-efficient reinforcement learning method to learn optimal policies in the presence of large and rare rewards.

Abstract: We introduce FVAC (Fleming-Viot Actor-Critic), an algorithm for efficient learning of optimal policies in reinforcement learning problems with rare, high-reward states. FVAC uses Actor-Critic policy gradient, with the critic estimated via the so-called Fleming-Viot particle system, a stochastic process used to model population evolution which is able to boost the visit frequency of the rare states. This frequency boosting is obtained by forcing exploration outside a set of states identified as highly visited during an initial exploration of the environment. The only requirements of the method are that learning must be set under the average reward criterion, and that a black-box simulator or emulator can be run on the environment. We showcase the method’s performance in windy grid worlds, where a non-zero reward is only observed at a terminal cell, which is difficult to reach due to the wind. Our results show that FVAC learns significantly faster than standard reinforcement learning algorithms based on Monte-Carlo exploration with temporal difference learning.

Submission Number: 127

Loading