Keywords: Offline Reinforcement Learning, Diffusion Model, Flow Matching, Reparameterized Policy Gradient
TL;DR: Learn the optimized action, do not learn to optimize action.
Abstract: Recent advancements in offline reinforcement learning leveraged two key innovations: policy extraction from behavior-regularized actor-critic (BRAC) objective and expressive policies, such as diffusion and flow models. However, backpropagation through iterative sampling chains is computationally tricky and often requires policy-specific solutions and careful hyperparameter tuning. We observe that the reparameterized policy gradient of the BRAC approximately trains the policy to replicate an 'optimal' action. Building on this insight, we introduce \textbf{Direct Optimal Action Learning (DOAL)}, an efficient, effective, and versatile framework for policy extraction from Q value functions. DOAL utilizes efficient behavior losses native to the policy's distribution (e.g., flow matching loss) to imitate an optimized action based on Q-values. Furthermore, we demonstrate that the traditional balancing factor between Q-loss and behavior-loss can be reinterpreted as a mechanism for selecting a trust region for the optimal action. The trust region reinterpretation yields a \textbf{Batch-Normalizing Optimizer}. This facilitates the hyperparameter search and makes it shareable across polices. Our DOAL framework can be easily integrated with any existing Q-value-based offline RL methods. We apply DOAL to Gaussian, Diffusion, and Flow policies. For Diffusion and Flow policies, our baseline models use the MaxQ action sampling, where the \textbf{number of samples} is tuned for each task. In particular, with regularized Q value estimation, flow policies achieved the best results. On 9 OGBench tasks, our baseline models outperformed the previous best models, and DOAL improves over strong baseline models while simplifying hyperparameter search. On 6 Adroit tasks from D4RL, improvement can be achieved when the Q value learning is regularized. The code is available through \href{https://anonymous.4open.science/r/iclr2026-7144}{Anonymous Github}.
Primary Area: reinforcement learning
Submission Number: 22606
Loading