TD-M(PC)$^2$: Improving Temporal Difference MPC Through Policy Constraint

ICLR 2026 Conference Submission15901 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: reinforcement learning, model-based reinforcement learning, model predictive control, continuous control
TL;DR: A simple yet effective MBRL framework that introduces policy regularization, mitigate the value overestimation originating from policy mismatch, and achieves significant performance gains on continuous control tasks.
Abstract: Model-based reinforcement learning (MBRL) algorithms that integrate model predictive control with learned value or policy priors have shown great potential to solve complex continuous control problems. However, existing practice relies on online planning to collect high-quality data, resulting in value learning that is entirely dependent on off-policy experiences. Contrary to the belief that value learned from model-free policy iteration within this framework is sufficiently accurate and expressive, we found that severe value overestimation bias occurs, especially in high-dimensional tasks. Through both theoretical analysis and empirical evaluations, we identify that this overestimation stems from a structural policy mismatch: the divergence between the exploration policy induced by the model-based planner and the exploitation policy evaluated by the value prior. To improve value learning, we emphasize conservatism that mitigates out-of-distribution queries. The proposed method, TD-M(PC)$^2$, addresses this by applying a soft-constrained policy update—a minimalist yet effective solution that can be seamlessly integrated into the existing plan-based MBRL pipeline without incurring additional computational overhead. Extensive experiments demonstrate that the proposed approach improves performance over baselines by large margins, particularly in 61-DoF humanoid control tasks.
Primary Area: reinforcement learning
Submission Number: 15901
Loading