Reinforcement Learning with Delayed, Composite, and Partially Anonymous Reward

Washim Uddin Mondal; Vaneet Aggarwal

Reinforcement Learning with Delayed, Composite, and Partially Anonymous Reward

Washim Uddin Mondal, Vaneet Aggarwal

Published: 28 Aug 2023, Last Modified: 17 Sept 2024Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: We investigate an infinite-horizon average reward Markov Decision Process (MDP) with delayed, composite, and partially anonymous reward feedback. The delay and compositeness of rewards mean that rewards generated as a result of taking an action at a given state are fragmented into different components, and they are sequentially realized at delayed time instances. The partial anonymity attribute implies that a learner, for each state, only observes the aggregate of past reward components generated as a result of different actions taken at that state, but realized at the observation instance. We propose an algorithm named $\mathrm{DUCRL2}$ to obtain a near-optimal policy for this setting and show that it achieves a regret bound of $\tilde{\mathcal{O}}\left(DS\sqrt{AT} + d (SA)^3\right)$ where $S$ and $A$ are the sizes of the state and action spaces, respectively, $D$ is the diameter of the MDP, $d$ is a parameter upper bounded by the maximum reward delay, and $T$ denotes the time horizon. This demonstrates the optimality of the bound in the order of $T$, and an additive impact of the delay.

Submission Length: Regular submission (no more than 12 pages of main content)

Supplementary Material: pdf

Assigned Action Editor: ~Jiantao_Jiao1

License: Creative Commons Attribution 4.0 International (CC BY 4.0)

Submission Number: 1121

Loading