Online Episodic Convex Reinforcement Learning

Bianca Marin Moreno; Khaled Eldowa; Pierre Gaillard; Margaux Brégère; Nadia Oudjane

Online Episodic Convex Reinforcement Learning

Bianca Marin Moreno, Khaled Eldowa, Pierre Gaillard, Margaux Brégère, Nadia Oudjane

Published: 01 May 2025, Last Modified: 14 Aug 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: We study online learning in episodic finite-horizon Markov decision processes (MDPs) with convex objective functions, known as the concave utility reinforcement learning (CURL) problem. This setting generalizes RL from linear to convex losses on the state-action distribution induced by the agent’s policy. The non-linearity of CURL invalidates classical Bellman equations and requires new algorithmic approaches. We introduce the first algorithm achieving near-optimal regret bounds for online CURL without any prior knowledge on the transition function. To achieve this, we use a novel online mirror descent algorithm with variable constraint sets and a carefully designed exploration bonus. We then address for the first time a bandit version of CURL, where the only feedback is the value of the objective function on the state-action distribution induced by the agent's policy. We achieve a sub-linear regret bound for this more challenging problem by adapting techniques from bandit convex optimization to the MDP setting.

Lay Summary: This paper explores a more general form of reinforcement learning (RL) called Convex Reinforcement Learning (CURL). Unlike traditional RL, which focuses on maximizing a reward signal over the agent trajectory, CURL allows to optimize more complex, convex objective functions based on the distribution of the agents over time. This flexibility makes CURL applicable to a wider range of real-world problems such as in energy grid optimization, in mean-field games or in multi-objective goals. However, the non-linear nature of these objectives breaks standard RL tools like the Bellman equation, requiring new algorithmic approaches. The paper introduces the first algorithm that achieves near-optimal learning performance in this setting when the objective changes over time, while requiring no prior knowledge of the environment's dynamics. The paper also addresses a more challenging version of the problem called the bandit setting, where the agent only observes the outcome of the strategy it actually used, without receiving any information about how other possible strategies would have performed.

Link To Code: https://github.com/biancammoreno/Convex_RL

Primary Area: Theory->Online Learning and Bandits

Keywords: Online learning, convex reinforcement learning, markov decision processes, bandits

Submission Number: 6842

Loading