Duality-based Residual Estimation for Fully Offline Value-based Reinforcement Learning

Published: 03 Feb 2026, Last Modified: 23 Apr 2026AISTATS 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We propose an offline validation metric for Q-function estimators with self-hyperparameter tuning mechanism.
Abstract: Value-based reinforcement learning (RL) efficiently handles high-dimensional state spaces, but existing methods lack a principled method for hyperparameter tuning without online interaction, limiting use in safety-critical and data-scarce domains. We propose the **Duality-based Residual Estimator (DRE)**, a simple offline validation metric for value-based offline RL. DRE is compatible with standard value-based Off-Policy Evaluation (OPE) and enables automatic hyperparameter selection, which is formalized through an adaptive extension of the Probably Approximately Correct (PAC) guarantee for Q-function selection. Our results address a key theoretical bottleneck toward *fully offline* value-based RL, which enables deployment without extensive online tuning.
Submission Number: 677
Loading