Conservative Query and Adaptive Regularization for Offline RL Under Uncertainty Estimation

Lirong zhou, Qin-Wen Luo, Sheng-Jun Huang

Published: 07 Sept 2025, Last Modified: 19 Dec 2025ECAI 2025EveryoneCC BY-NC 4.0

Abstract: Offline reinforcement learning (RL) aims to learn an effective policy from a static dataset, but the achievable performance is fundamentally limited by the coverage of the dataset. The action preference query mechanism leverages expert feedback without requiring environment interaction, enabling performance improvements during offline training while avoiding the cost and risks associated with online fine-tuning. However, existing methods still face significant challenges, both in designing helpful query strategies and in efficiently exploiting the collected preferences. Current approaches typically select queries based solely on the distance between policy actions and dataset actions, and apply naive constraints that compel the policy to remain close to the queried preferences. Such strategies often lead to unstable and inefficient policy updates, and pose challenges for integration with value regularization methods. To address these issues, we propose conservative query and adaptive regularization under uncertainty estimation, a novel and lightweight framework that jointly tackles both the challenges of the preference query and exploitation. Specifically, we first employ the Morse neural network to quantify the uncertainty of the given action relative to the dataset. To facilitate helpful queries, we introduce the uncertainty-driven conservative query mechanism that leverages uncertainty estimation to selectively query actions near the dataset to preserve the stability of Bellman updates. For more effective preference exploitation, we propose the uncertainty-aware adaptive regularization to dynamically modulates the strength of data-level constraints based on the uncertainty of policy actions, enabling the policy to benefit from reliable Bellman updates. We integrate our framework with CQL and perform extensive experiments on the D4RL benchmark. The results demonstrate that our method achieves superior or competitive performance across various tasks.