Exploration for Deployment-Efficient Reinforcement Learning Agents

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Real-World Exploration, Deployment Efficient RL, Real-World RL
TL;DR: We present a novel exploration framework for offline reinforcement learning agents deployed in the real world that emphasizes the need to reduce uncertainty through a carefully constructed data collection policy.
Abstract: Reinforcement learning (RL) provides a rich toolbox with which to learn sequential decision making policies. Notably, the ability to learn solely from offline interaction data has been a highly successful modality for training real-world policies. However, a gap exists in this paradigm when the offline dataset does not cover all the behaviors necessary to extract optimal policies. Naively, one can pre-train a policy using offline RL and fine-tune it using online RL; this can lead to catastrophe in settings like healthcare and autonomous driving, where deploying an unverified policy is irresponsible. Deployment efficient learning is a potential solution, where the number of distinct data collection policies is relatively low compared to the number of updates to the policy. We argue that safely improving a dataset requires a deployment efficient algorithm with a carefully constructed data collection policy. We introduce a framework with a stationary exploration policy that aims to reduce out-of-distribution uncertainty while maintaining strong returns. We establish theoretical guarantees of this exploration framework without finetuning and demonstrate our method on a large-scale supply chain environment with real-world data.
Supplementary Material: zip
Primary Area: reinforcement learning
Submission Number: 24176
Loading