The Value of Reward Lookahead in Reinforcement Learning

Nadav Merlis; Dorian Baudry; Vianney Perchet

The Value of Reward Lookahead in Reinforcement Learning

Nadav Merlis, Dorian Baudry, Vianney Perchet

Published: 25 Sept 2024, Last Modified: 06 Nov 2024NeurIPS 2024 spotlightEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reinforcement Learning, Planning, Reward Lookahead, Competitive Ratio

TL;DR: The paper studies the potential increase in the value of RL problems given partial observations of the future realized rewards.

Abstract: In reinforcement learning (RL), agents sequentially interact with changing environments while aiming to maximize the obtained rewards. Usually, rewards are observed only _after_ acting, and so the goal is to maximize the _expected_ cumulative reward. Yet, in many practical settings, reward information is observed in advance -- prices are observed before performing transactions; nearby traffic information is partially known; and goals are oftentimes given to agents prior to the interaction. In this work, we aim to quantifiably analyze the value of such future reward information through the lens of _competitive analysis. In particular, we measure the ratio between the value of standard RL agents and that of agents with partial future-reward lookahead. We characterize the worst-case reward distribution and derive exact ratios for the worst-case reward expectations. Surprisingly, the resulting ratios relate to known quantities in offline RL and reward-free exploration. We further provide tight bounds for the ratio given the worst-case dynamics. Our results cover the full spectrum between observing the immediate rewards before acting to observing all the rewards before the interaction starts.

Primary Area: Reinforcement learning

Submission Number: 2972

Loading