Mitigating the Curse of Horizon in Monte-Carlo Returns

Alex Ayoub; David Szepesvari; Francesco Zanini; Bryan Chan; Dhawal Gupta; Bruno Castro da Silva; Dale Schuurmans

Mitigating the Curse of Horizon in Monte-Carlo Returns

Alex Ayoub, David Szepesvari, Francesco Zanini, Bryan Chan, Dhawal Gupta, Bruno Castro da Silva, Dale Schuurmans

Published: 15 May 2024, Last Modified: 14 Nov 2024RLC 2024EveryoneRevisionsBibTeXCC BY 4.0

Keywords: RL algorithms, planning

Abstract: The standard framework in reinforcement learning (RL) dictates that an agent should use every observation collected from interactions with the environment when updating its value estimates. As this sequence of observations becomes longer, the agent is afflicted with the curse of horizon since the computational cost of its updates scales linearly with the length of the sequence. In this paper, we propose methods to mitigate this curse when computing value estimates with Monte-Carlo methods. This is accomplished by selecting a subsequence of observations on which the value estimates are computed. We empirically demonstrate on standard RL benchmarks that adopting an adaptive sampling scheme outperforms the default uniform sampling procedure.

Submission Number: 80

Loading