The Value of Reward Lookahead in Reinforcement Learning

Nadav Merlis; Dorian Baudry; Vianney Perchet

The Value of Reward Lookahead in Reinforcement Learning

Nadav Merlis, Dorian Baudry, Vianney Perchet

Published: 17 Jun 2024, Last Modified: 05 Jul 2024FoRLaC PosterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: In reinforcement learning (RL), agents sequentially interact with changing environments while aiming to maximize the obtained rewards. Usually, rewards are observed only _after_ acting, and so the goal is to maximize the _expected_ cumulative reward. Yet, in many practical settings, reward information is observed in advance -- prices are observed before performing transactions; nearby traffic information is partially known; and goals are oftentimes given to agents prior to the interaction. In this work, we aim to quantifiably analyze the value of such future reward information through the lens of _competitive analysis_. In particular, we measure the ratio between the value of standard RL agents and that of agents with partial future-reward lookahead. We characterize the worst-case reward distribution and derive exact ratios for the worst-case reward expectations. Surprisingly, the resulting ratios relate to known quantities in offline RL and reward-free exploration. We further provide tight bounds for the ratio given the worst-case dynamics. Our results cover the full spectrum between observing the immediate rewards before acting to observing all the rewards before the interaction starts.

Format: Long format (up to 8 pages + refs, appendix)

Publication Status: No

Submission Number: 8

Loading