Near-optimal Regret Using Policy Optimization in Online MDPs with Aggregate Bandit Feedback

Tal Lancewicki; Yishay Mansour

Near-optimal Regret Using Policy Optimization in Online MDPs with Aggregate Bandit Feedback

Tal Lancewicki, Yishay Mansour

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: First Policy Optimization algorithms and improved bounds for online MDPs with aggregate bandit feedback

Abstract: We study online finite-horizon Markov Decision Processes with adversarially changing loss and aggregate bandit feedback (a.k.a full-bandit). Under this type of feedback, the agent observes only the total loss incurred over the entire trajectory, rather than the individual losses at each intermediate step within the trajectory. We introduce the first Policy Optimization algorithms for this setting. In the known-dynamics case, we achieve the first *optimal* regret bound of $\tilde \Theta(H^2\sqrt{SAK})$, where $K$ is the number of episodes, $H$ is the episode horizon, $S$ is the number of states, and $A$ is the number of actions. In the unknown dynamics case we establish regret bound of $\tilde O(H^3 S \sqrt{AK})$, significantly improving the best known result by a factor of $H^2 S^5 A^2$.

Lay Summary: We study the challenge of training reinforcement learning agents when feedback is only available as the total loss at the end of each episode, a situation known as aggregate bandit feedback. This is common in settings like robotics or dialogues with an LLM, where step-by-step feedback is typically unavailable. Our work introduces the first Policy Optimization algorithms for this problem. When the environment’s dynamics are known, our method achieves the first optimal regret bound (a common performance measure) for this setting. When the dynamics are unknown, our approach substantially improves regret compared to previous best results, marking a significant step forward for learning in environments with limited feedback.

Primary Area: Theory->Reinforcement Learning and Planning

Keywords: Online MDPs, Policy Optimization, Aggregate Bandit Feedback, Full-bandit feedback, Reinforcement Learning, Regret Minimization, Adversarial MDPs

Submission Number: 7918

Loading