Abstract: Model-based reinforcement learning (MBRL) reduces the cost of
real-environment sampling by generating synthetic trajectories (called
rollouts) from a learned dynamics model. However, choosing the length of the rollouts
poses two dilemmas: (1) Longer rollouts better preserve on-policy training
but amplify model bias, indicating the need for an intermediate
horizon to mitigate distribution shift (i.e., the gap between
on-policy and past off-policy samples). (2) Moreover, a longer
model rollout may reduce value estimation bias but raise the variance
of policy gradients due to backpropagation through multiple steps,
implying another intermediate horizon for stable gradient estimates.
However, these two optimal horizons may differ. To resolve this
conflict, we propose Double Horizon Model-Based Policy Optimization
(DHMBPO), which divides the rollout procedure into a long
``distribution rollout'' (DR) and a short ``training rollout'' (TR).
The DR generates on-policy state samples for mitigating distribution
shift. In contrast, the short TR leverages differentiable
transitions to offer accurate value gradient estimation with stable
gradient updates, thereby requiring fewer updates and reducing overall
runtime. We demonstrate that the double-horizon approach effectively
balances distribution shift, model bias, and gradient instability, and
surpasses existing MBRL methods on continuous-control benchmarks in
terms of both sample efficiency and runtime.
Submission Length: Regular submission (no more than 12 pages of main content)
Supplementary Material: zip
Assigned Action Editor: ~Amir-massoud_Farahmand1
Submission Number: 4184
Loading