Reinforcement Learning with Discrete Diffusion Policies for Combinatorial Action Spaces

Ofir Nabati; Haitong Ma; Aviv Rosenberg; Bo Dai; Oran Lang; Craig Boutilier; Na Li; Shie Mannor; Lior Shani; Guy Tennenholtz

Reinforcement Learning with Discrete Diffusion Policies for Combinatorial Action Spaces

Ofir Nabati, Haitong Ma, Aviv Rosenberg, Bo Dai, Oran Lang, Craig Boutilier, Na Li, Shie Mannor, Lior Shani, Guy Tennenholtz

19 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reinforcement learning, discrete diffusion

TL;DR: We show how to train discrete diffusion models as policies for reinforcement learning in large combinatorial action spaces, achieving state-of-the-art results and better efficiency than autoregressive models.

Abstract: Reinforcement learning (RL) struggles to scale to large, combinatorial action spaces common in many real-world problems. This paper introduces a novel framework for training discrete diffusion models as highly effective policies in these complex settings. Our key innovation is an efficient online training process that ensures stable and effective policy improvement. By leveraging policy mirror descent (PMD) to define an ideal, regularized target policy distribution, we frame the policy update as a distributional matching problem, training the expressive diffusion model to replicate this stable target. This decoupled approach stabilizes learning and significantly enhances training performance. Our method achieves state-of-the-art results and superior sample efficiency across a diverse set of challenging combinatorial benchmarks, including DNA sequence generation, RL with macro-actions, and multi-agent systems. Experiments demonstrate that our diffusion policies attain superior performance compared to other baselines.

Primary Area: reinforcement learning

Submission Number: 19930

Loading