Decoupling Tilting from Transport: Stable Online Alignment of Flow and Diffusion Policies

Chubin Zhang; Zhenglin Wan; Feng Chen; Fuchao Yang; Lang Feng; Yaxin Zhou; Xingrui Yu; Yang You; Ivor Tsang; Bo An

Decoupling Tilting from Transport: Stable Online Alignment of Flow and Diffusion Policies

Chubin Zhang, Zhenglin Wan, Feng Chen, Fuchao Yang, Lang Feng, Yaxin Zhou, Xingrui Yu, Yang You, Ivor Tsang, Bo An

Published: 02 Mar 2026, Last Modified: 29 Mar 2026ReALM-GEN 2026 - ICLR 2026 WorkshopEveryoneRevisionsCC BY 4.0

Keywords: flow matching, diffusion policy, online reinforcement learning, preference alignment, distribution tilting, reward-guided sampling, guidance, transport map, robotics

TL;DR: Decouple reward-based distribution tilting from transport in flow/diffusion policies to enable stable online alignment and consistently improve returns.

Abstract: Expressive generative models, such as diffusion and flow matching, have shown great promise in representing multimodal distributions for continuous control. However, aligning these models with dynamic reward signals via online reinforcement learning (RL) remains a formidable challenge, primarily due to intractable likelihoods and the instability of propagating gradients through long sampling chains. In this work, we introduce GoRL (Generative Online Reinforcement Learning), a framework that achieves stable *reward-guided* alignment by structurally decoupling optimization from generation. We view online improvement as *reward-guided distribution tilting*, and realize it by decoupling *tilting from transport*: GoRL confines the alignment process to a tractable latent space—effectively learning a tractable steering policy—while delegating complex action synthesis to a conditional generative decoder. Crucially, unlike methods that steer fixed backbones, GoRL *co-evolves* the tilting and transport mechanisms on two timescales. We employ a **prior-anchored refinement** strategy that prevents collapse by forcing the transport map to progressively expand its support to cover the high-reward modes discovered by the latent policy. Empirically, GoRL demonstrates superior stability and performance in aligning flow- and diffusion-based policies, achieving episodic returns exceeding $3\times$ that of strong baselines on challenging tasks like HopperStand.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 2

Loading