Exploration for the Efficient Deployment of Reinforcement Learning Agents

Max Rudolph; Siddhant Agarwal; Omer Gottesman; Amy Zhang; Akhil Bagaria; Sohrab Andaz; Udaya Ghai; Carson Eisenach

Exploration for the Efficient Deployment of Reinforcement Learning Agents

Max Rudolph, Siddhant Agarwal, Omer Gottesman, Amy Zhang, Akhil Bagaria, Sohrab Andaz, Udaya Ghai, Carson Eisenach

Published: 17 Jun 2025, Last Modified: 28 Jun 2025RL4RS 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: deployment efficiency, real-world reinforcement learning, uncertainty-based exploration

TL;DR: We introduce a series of stationary exploration policies for performing targeted, uncertainty reducing exploration in real-world settings.

Abstract: Reinforcement learning (RL) provides a rich toolbox with which to learn sequential decision making policies. Notably, the ability to learn solely from offline interaction data has been a highly successful modality for training real-world policies without ever interacting in the real world. However, a gap exists in this paradigm when the offline dataset does not cover all the behaviors necessary to extract optimal policies. Naively, one can pre-train a policy using offline RL and fine-tune it using online RL; this can lead to catastrophe in safety critical settings, like healthcare and autonomous driving, where deploying an unverified policy is irresponsible. Deployment efficient learning is a potential solution, where the number of distinct data collection policies is relatively low compared to the number of updates to the policy. We argue that safely improving a dataset requires a deployment efficient algorithm with a carefully constructed data collection policy. We introduce a framework with a stationary exploration policy that aims to reduce out-of-distribution uncertainty while maintaining strong returns. We establish theoretical guarantees of this exploration framework without fine-tuning and demonstrate our methods on a small-scale toy environment and a large-scale supply chain environment with real-world data.

Submission Number: 14

Loading