Feasibility-Constrained Diffusion-MPC for Discrete Combinatorial Planning: A Case Study on Tetris
Keywords: diffusion planning, model predictive control, discrete planning, MaskGIT, reinforcement learning, exploration, Tetris, decision-level regret, rollout-based control
TL;DR: Feasibility masking is the dominant mechanism in discrete diffusion-MPC; substituting a learned Q-critic for rollout reranking erases the gain (mean regret 17.6), and a bounded hybrid reranker recovers full performance.
Abstract: Diffusion-based model predictive control (Diffusion-MPC) has shown strong results in continuous robotic control, but its behaviour in discrete combinatorial domains—where a single infeasible token terminates the trajectory and rewards are sparse—remains poorly understood. We study this regime through Tetris, an NP-hard puzzle that mirrors the discrete decision structure of many robotics planning problems. We introduce DiffTetris, a MaskGIT-style discrete denoiser used as a sampling-based MPC proposal, and isolate three design questions: (i) must sampling be feasibility-constrained? (ii) can a learned $Q$-critic substitute for rollout-based reranking? and (iii) how does the planner compare against pure value-learning agents at matched compute? On 100-episode evaluations we find a single, large-effect mechanism: logit masking against the valid-placement set drives a $6.8\times$ gain in mean score and a $5.6\times$ gain in survival rate, while replacing rollout reranking with a competently trained Double-DQN critic erases the masking benefit entirely, with mean decision-level regret $17.6$ (p90 $36.6$). A bounded hybrid reranker—rollout score plus a small ($\alpha=0.05$) z-scored DQN term—fully recovers performance with near-zero regret. Contextualizing against tabular-Q, MLP/CNN Double-DQN, and Deep V-Network (DVN) baselines, we find that the planner's success traces to the same inductive bias that lets the model-based DVN dominate model-free DQN: in deterministic combinatorial domains, decision-time forward simulation is a stronger selection signal than learned bootstrapped values. We propose decision-level regret as a portable diagnostic for critic alignment in sampling-based planners.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 24
Loading