Sustainable ℓ2-regularized actor-critic based on recursive least-squares temporal difference learning

Luntong Li, Dazi Li, Tianheng Song

Published: 2017, Last Modified: 13 Nov 2024SMC 2017EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Least-squares temporal difference learning (LSTD) has been used mainly for improving the data efficiency of the critic in actor-critic (AC). However, convergence analysis of the resulted algorithms is difficult when policy is changing. In this paper, a new AC method is proposed based on LSTD under discount criterion. The method comprises two components as the contribution: (1) LSTD works in an on-policy way to achieve a good convergence property of AC. (2) A sustainable ℓ 2 -regularization version of recursive LSTD, which is termed as RRLSTD, is proposed to solve the ℓ 2 -regularization problem of the critic in AC. To reduce the computation complexity of RRLSTD, we propose a fast version that is termed as FRRLSTD. Simulation results show that RRLSTD/FRRLSTD-based AC methods have better learning efficiency and faster convergence rate than conventional AC methods.