Return Capping: Sample Efficient CVaR Policy Gradient Optimisation

Harry Mead; Clarissa Costen; Bruno Lacerda; Nick Hawes

Return Capping: Sample Efficient CVaR Policy Gradient Optimisation

Harry Mead, Clarissa Costen, Bruno Lacerda, Nick Hawes

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We present a sample-efficient method for policy gradient CVaR optimisation by capping trajectory returns, rather than discarding trajectories.

Abstract: When optimising for conditional value at risk (CVaR) using policy gradients (PG), current methods rely on discarding a large proportion of trajectories, resulting in poor sample efficiency. We propose a reformulation of the CVaR optimisation problem by capping the total return of trajectories used in training, rather than simply discarding them, and show that this is equivalent to the original problem if the cap is set appropriately. We show, with empirical results in an number of environments, that this reformulation of the problem results in consistently improved performance compared to baselines. We have made all our code available here: \url{https://github.com/HarryMJMead/cvar-return-capping}.

Lay Summary: When training agents to do certain tasks, such as finding the optimal strategy to a game, generally the agent is trained to maximise its average performance. However, in our work we instead focus on training agents that perform well even in the worst case scenarios. When training an agent to play a game, reinforcement learning involves repeatedly playing the game and using these repeated plays - often referred to as rollouts - to determine which actions result in the best outcomes. Current methods for maximising agent performance in the worst case rely on discarding the majority of these rollouts and only learning from the worst performing rollouts. We found that, instead of discarding these rollouts, by capping their performance - effectively treating them as performing worse than they actually did - and then learning from these capped rollouts, agents were able to learn better and more efficiently how to perform well in the worst case.

Link To Code: https://github.com/HarryMJMead/cvar-return-capping

Primary Area: Reinforcement Learning->Deep RL

Keywords: Reinforcement Learning, Machine Learning, CVaR, Risk-Averse

Submission Number: 13664

Loading