Direct Optimal Action Learning

Direct Optimal Action Learning

ICLR 2026 Conference Submission22606 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Offline Reinforcement Learning, Diffusion Model, Flow Matching, Reparameterized Policy Gradient

TL;DR: Learn the optimized action, do not learn to optimize action.

Abstract: Recent advancements in offline reinforcement learning have leveraged two key innovations: policy extraction from behavior-regularized actor-critic (BRAC) objective and the use of expressive policy architectures, such as diffusion and flow models. However, backpropagation through iterative sampling chains is computationally tricky and often requires policy-specific solutions and careful hyperparameter tuning. We observe that the reparameterized policy gradient of the BRAC objective trains the policy to clone an ``optimal'' action. Building on this insight, we introduce \textbf{Direct Optimal Action Learning (DOAL)}, a novel framework that directly learns this ``optimal'' action. Then, efficient behavior losses native to the policy's distribution (e.g., flow matching loss) can be used for efficient learning. Furthermore, we demonstrate that the traditional balancing factor between Q-loss and behavior-loss can be reinterpreted as a mechanism for selecting a trust region for the optimal action. The trust region reinterpretation yields a \textbf{Batch-Normalizing Optimizer}. This facilitates the hyperparameter search and makes it shareable across policy distributions. Our DOAL framework can be easily integrated with any existing Q-value-based offline RL methods. To control the impact of value estimation, our baseline models use simple behavior clone loss and implicit q-learning. We apply DOAL to Gaussian, Diffusion, and Flow policies. In particular, for Diffusion and Flow policies, we obtained strong baseline models by improving the \textbf{MaxQ Action Sampling}. Our results on 15 tasks from the OGBench and D4RL adroit datasets show that DOAL consistently improves performance compared against strong baseline models while simplifying hyperparameter search. Our best models achieved very strong performance. The code is available through \href{https://anonymous.4open.science/r/iclr2026-7144}{Anonymous Github}.

Primary Area: reinforcement learning

Submission Number: 22606

Loading