Keywords: offline reinforcement learning, diffusion models, sequential decision-making
TL;DR: Diffusion-based offline RL fails not because of poor value estimation, but because the proposal distribution lacks coverage of solution-relevant behaviors.
Abstract: Proposal-selection frameworks for offline reinforcement learning (RL) enable decision-making through candidate generation and value-based selection. However, it remains unclear where performance variations arise within this process, particularly across intermediate steps. We study a fundamental limitation of proposal-based decision-making: the extent to which decision quality is constrained by the support of the proposal distribution. Using ARC-AGI as a controlled analysis environment, we construct Synthesized Offline Learning data for Abstraction and Reasoning (SOLAR), a trajectory-based dataset that converts tasks into step-by-step solution trajectories. This enables step-level analysis of decision-making across controlled trajectory distributions. Through experiments with Latent Diffusion-Constrained Q-Learning (LDCQ), we find that decision quality is closely tied to proposal coverage: while the selection mechanism remains reliable when suitable candidates are present, performance degrades sharply when the proposal distribution fails to cover solution-relevant behaviors. These results identify proposal coverage as a key bottleneck in decision-making with diffusion-based offline RL, pointing to directions for improving robustness and generalization.
Submission Number: 49
Loading