STITCH-OPE: Trajectory Stitching with Guided Diffusion for Off-Policy Evaluation

Hossein Goli; Michael Gimelfarb; Nathan Samuel de Lara; Masha Itkina; Haruki Nishimura; Florian Shkurti

STITCH-OPE: Trajectory Stitching with Guided Diffusion for Off-Policy Evaluation

Hossein Goli, Michael Gimelfarb, Nathan Samuel de Lara, Masha Itkina, Haruki Nishimura, Florian Shkurti

Published: 12 Jun 2025, Last Modified: 21 Jun 2025RobotEvaluation@RSS 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Type: An evaluation-centric paper (focused on advancing methods and ideas for robot evaluation)

Keywords: Off Policy Evaluation, Reinforcement Learning, Generative Models

TL;DR: We introduce STITCH-OPE, a guided-diffusion framework for off-policy evaluation that stitches short behavior-conditioned sub-trajectories, uses negative-behavior guidance to correct distribution shift, and outperforms baselines across all metrics.

Abstract: Off-policy evaluation (OPE) estimates the performance of a target policy using offline data collected from a behavior policy, and is crucial in domains such as robotics or healthcare where direct interaction with the environment is costly or unsafe. Existing OPE methods are ineffective for high-dimensional, long-horizon problems, due to exponential blow-ups in variance from importance weighting or compounding errors from learned dynamics models. To address these challenges, we propose STITCH-OPE, a model-based generative framework that leverages denoising diffusion for long-horizon OPE in high-dimensional state and action spaces. Starting with a diffusion model pre-trained on the behavior data, STITCH-OPE generates synthetic trajectories from the target policy by guiding the denoising process using the score function of the target policy. STITCH-OPE proposes two technical innovations that make it advantageous for OPE: (1) prevents over-regularization by subtracting the score of the behavior policy during guidance, and (2) generates long-horizon trajectories by stitching partial trajectories together end-to-end. We provide a theoretical guarantee that, under mild assumptions, these modifications result in an exponential reduction in variance versus long-horizon trajectory diffusion. Experiments on the D4RL and OpenAI Gym benchmarks show substantial improvement in mean squared error, correlation, and regret metrics compared to state-of-the-art OPE methods.

Submission Number: 4

Loading