Revisiting Maximum Mean Discrepancy via Diffusion Behavior Policy in Offline RL: A Mode-Seeking Perspective

19 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Offline Reinforcement Learning, Maximum Mean Discrepancy, Policy Constraint
Abstract: Policy constraint is an effective way to mitigate distributional shift in offline Reinforcement Learning (RL). However, a key challenge lies in identifying the mode of the behavior distribution that corresponds to the highest return, thereby avoiding unnecessary constraints on suboptimal actions. The reverse KL divergence constraint can provide this capability, but its efficacy is limited by the fidelity of the behavior model, such as Gaussian distributions. Diffusion models, while providing expressive behavior, cannot be directly employed with KL divergence due to the absence of analytic probability formula. In contrast, the Maximum Mean Discrepancy (MMD) constraint operates solely on samples generated by diffusion policies, prompting us to re-examine its potential. Surprisingly, our numerical studies reveal an intriguing insight: MMD exhibits strong mode-seeking capabilities when applied to high-fidelity diffusion behavior policies guided by value signals. This finding corrects a misunderstanding in previous MMD-based methods and shows that their failures were primarily due to distorted behavior modeling. We further investigate the effect of value perturbation on MMD's mode-seeking behavior and, accordingly, revolutionize the MMD-based policy constraint method for offline RL. Extensive experiments on the D4RL benchmark show that our method significantly outperforms prior MMD-based methods and achieves state-of-the-art performance.
Primary Area: reinforcement learning
Submission Number: 15365
Loading