Stochastic Truncation for Multi-Step Off-Policy RL

Stochastic Truncation for Multi-Step Off-Policy RL

ICLR 2026 Conference Submission13092 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: multi-step off-policy reinforcement learning

Abstract: Multi-step off-policy reinforcement learning is crucial for reliable policy evaluation in long-horizon settings. However, extending beyond one-step temporal-difference learning remains challenging due to distribution mismatch between behavior and target policies. This mismatch becomes more severe at longer horizons, resulting in compounding bias and variance. Existing approaches generally fall into two categories: conservative methods (e.g., Retrace), which guarantee convergence but often suffer from high variance, and non-conservative methods (e.g., Peng’s $Q(\lambda)$ and integrated algorithms such as Rainbow), which often achieve strong empirical performance but lack convergence guarantees under arbitrary exploration. We identify horizon selection as a central obstacle and connect it to the mixing time of policy-induced Markov chains. Since mixing time is difficult to estimate online, we derive a practical upper bound through a coupling-based analysis to guide adaptive truncation. Building on this insight, we propose T4 (Time To Truncate Trajectory), a stochastic and adaptive truncation mechanism within the Retrace framework. We prove that T4 is non-conservative yet converges under arbitrary behavior policies, and is robust to cap-length tuning. Empirically, T4 improves both policy evaluation and control performance over strong baselines on standard RL benchmarks.

Primary Area: reinforcement learning

Submission Number: 13092

Loading