Skill or Luck? Return Decomposition via Advantage Functions

Hsiao-Ru Pan; Bernhard Schölkopf

Skill or Luck? Return Decomposition via Advantage Functions

Hsiao-Ru Pan, Bernhard Schölkopf

Published: 20 Jul 2023, Last Modified: 29 Aug 2023EWRL16Readers: Everyone

Keywords: off-policy learning, multi-step method, advantage function estimation

Abstract: Learning from off-policy data is essential for sample-efficient reinforcement learning. In the present work, we build on the insight that the advantage function can be understood as the causal effect of an action on the return, and show that this allows us to decompose the return of a trajectory into parts caused by the agent's actions (skill) and parts outside of the agent's control (luck). Furthermore, this decomposition enables us to naturally extend Direct Advantage Estimation (DAE) to off-policy settings (Off-policy DAE). The resulting method can learn from off-policy trajectories without relying on importance sampling techniques or truncating off-policy actions. We compare the uncorrected multi-step method, which has shown strong empirical results despite ignoring off-policy corrections, to DAE and Off-policy DAE, and provide intuition on when the corrections can be omitted. Finally, we use the MinAtar environments to illustrate how ignoring off-policy corrections can lead to suboptimal policy optimization performance.

1 Reply

Loading