Keywords: discrete diffusion, fine-tuning, reinforcement learning, reward optimization, reward alignment, adaptive decoding
TL;DR: We introduce A2D2, a unified framework for reward-guided joint fine-tuning of an any-length masked diffusion model policy and adaptive inference schedule.
Abstract: Masked discrete diffusion models (MDMs) offer a simple and stable likelihood-based framework for sequence generation and have recently been extended to any-length settings via token insertion. However, principled reward-guided fine-tuning for any-length discrete diffusion remains largely unexplored. We introduce Finetuning **A**ny-Length **D**iscrete **D**iffusion for **A**daptive Decoding (**A2D2**), a unified framework for reward-guided fine-tuning of any-length MDMs. A2D2 formulates generation as a controlled continuous-time Markov chain and jointly optimizes insertion and unmasking policies to learn a reward-tilted path measure without requiring target samples. We derive the Radon–Nikodym derivative for the joint insertion–unmasking process and introduce the Adaptive Joint Decoding (AJD) loss, which provably minimizes trajectory-induced error while preserving the target distribution. Empirically, A2D2 improves reward optimization, generation accuracy, and flexibility over prior fixed-length and inference-time guidance methods.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 40
Loading