Efficient Multi-Step Reinforcement Learning with Expectation-Maximization Bootstrapping

07 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reinforcement Learning, Temporal-Difference Learning, Multi-step Reinforcement Learning
Abstract: Multi-step reinforcement learning (RL) improves agent performance by propagating temporal information across long time lags between actions and consequences through bootstrapping. The key challenge is how to aggregate information from different bootstrapping steps to enable fast learning while maintaining stability. Many existing multi-step RL methods (e.g., Retrace($\lambda$)) primarily focus on the bias–variance tradeoffs but do not explicitly select bootstrapping steps to balance salience and stability (S\&S). We first analyze S\&S in multi-step RL, and introduce a novel corresponding metric to quantify different bootstrapping steps. Viewing bootstrapping steps as the latent variables, our Expectation-Maximization (EM) Bootstrapping (EMB) formulates multi-step RL as an EM procedure, alternating between the E-step: estimating expectations under predefined posterior weights to measure the S\&S of bootstrapping steps, and the M-step: using these estimated expectations to guide the selection of bootstrapping steps. This yields a new return-based Bellman operator EMB($\lambda$). We theoretically establish its convergence and optimality properties. Empirical results on the Atari Learning Environment demonstrate that EMB($\lambda$) significantly outperforms existing multi-step RL methods in both learning efficiency and final performance, matching the performance of Retrace($\lambda$) with approximately $50\\%$ fewer samples on the Atari-10 suite.
Supplementary Material: zip
Primary Area: reinforcement learning
Submission Number: 2841
Loading