Bootstrap Ensemble Uncertainty for State-Adaptive Regularization in Offline Reinforcement Learning

Rishav Rishav; Vincent Michalski; Samira Ebrahimi Kahou

Bootstrap Ensemble Uncertainty for State-Adaptive Regularization in Offline Reinforcement Learning

Rishav Rishav, Vincent Michalski, Samira Ebrahimi Kahou

Published: 23 Sept 2025, Last Modified: 01 Dec 2025ARLETEveryoneRevisionsBibTeXCC BY 4.0

Track: Research Track

Keywords: Offline Reinforcement Learning

Abstract: Offline reinforcement learning (RL) aims to learn an effective policy from a static dataset. An important challenge, however, is handling of out-of-distribution (OOD) actions that are poorly represented in the data. OOD actions lead to unreliable value estimates and subsequent policy degradation. Traditional approaches address this through regularization, but apply a single fixed regularization coefficient uniformly across all states. This creates a trade-off: in data-rich regions where the model is confident, excessive regularization unnecessarily constrains learning, shifting the policy towards behavior cloning and limiting policy improvement, while in data-sparse regions where the model is uncertain, insufficient regularization fails to prevent value overestimation and leads to extrapolation errors. We propose confidence-based uncertainty regularization enhancement (CURE) that addresses this trade-off using adaptive regularization. CURE aims to tie regularization strength directly to the model’s confidence about each state-action pair. CURE quantifies this confidence using disagreement within a bootstrap ensemble of critics as a measure of epistemic uncertainty, then feeds this to a learned network that outputs state-specific regularization coefficients. This results in stronger regularization in uncertain regions and optimistic learning where data supports it. CURE, when integrated with established methods from both value regularization and policy constraint paradigms in Offline RL exhibits improvements over baselines on the D4RL benchmark, especially on tasks featuring mixed-quality data.

Submission Number: 104

Loading