Abstract: Behavioral cloning trains policies to mimic expert demonstrations, but provides no mechanism for recovery when the agent deviates from the training distribution at test time. We address this limitation through a new paradigm: test-time planning. Our approach learns a latent world model and reward model from expert demonstrations, then uses these components at inference to search for corrective actions when the base policy begins to fail. Concretely, we combine a hierarchical diffusion policy trained via imitation learning with Model-Predictive Control (MPC) in the learned latent space, enabling the ego vehicle to plan recovery trajectories during inference without additional human supervision. We evaluate our method on the nuPlan and CARLA planning benchmarks demonstrating that our test-time planning approach is consistently able to recover from distribution shifts that cause the base policy to fail. Our results suggest that integrating search-based planning with learned world models provides a robust framework for handling inference-time distribution shifts in embodied agents.
Submission Number: 44
Loading