TL;DR: We prove a near-optimal regret bound of $\tilde{O}(\sqrt{SAH^3K})$ for finite-horizon MDP with trajectory feedback.
Abstract: In this work, we study reinforcement learning (RL) with trajectory feedback.
Compared to the standard RL setting, in RL with trajectory feedback, the agent only observes the accumulative reward along the trajectory, and therefore, this model is particularly suitable for scenarios where querying the reward in each single step incurs prohibitive cost.
For a finite-horizon Markov Decision Process (MDP) with $S$ states, $A$ actions and a horizon length of $H$, we develop an algorithm that enjoys an asymptotically nearly optimal regret of $\tilde{O}\left(\sqrt{SAH^3K}\right)$ in $K$ episodes.
To achieve this result, our new technical ingredients include
(i) constructing a tighter confidence region for the reward function by incorporating the RL with trajectory feedback setting with techniques in linear bandits and
(ii) constructing a reference transition model to better guide the exploration process.
Lay Summary: We study reinforcement learning (RL) with trajectory feedback, where the agent only observes the cumulative reward along the trajectory. This model is particularly suitable for scenarios where querying the reward at each step incurs prohibitive costs. We develop an algorithm that achieves asymptotically near-optimal regret for finite-horizon Markov Decision Processes.
Primary Area: Theory->Reinforcement Learning and Planning
Keywords: reinforcement learning, online learning, trajectory feedback
Submission Number: 7741
Loading