Off Policy Average Reward Actor Critic with Deterministic Policy Search

Naman Saxena; Subhojyoti Khastagir; Shishir N Y; Shalabh Bhatnagar

Off Policy Average Reward Actor Critic with Deterministic Policy Search

Naman Saxena, Subhojyoti Khastagir, Shishir N Y, Shalabh Bhatnagar

Published: 01 Feb 2023, Last Modified: 26 May 2025Submitted to ICLR 2023Readers: Everyone

Keywords: reinforcement learning, actor critic algorithm, deterministic policy, off-policy, target network, average reward, finite time analysis, convergence, three time scale stochastic approximation, DeepMind control suite

TL;DR: This paper proposes actor critic algorithm with deterministic policy for the average reward criterion

Abstract: The average reward criterion is relatively less explored as most existing works in the Reinforcement Learning literature consider the discounted reward criterion. There are few recent works that present on-policy average reward actor-critic algorithms, but average reward off-policy actor-critic is relatively less explored. In this paper, we present both on-policy and off-policy deterministic policy gradient theorems for the average reward performance criterion. Using these theorems, we also present an Average Reward Off-Policy Deep Deterministic Policy Gradient (ARO-DDPG) Algorithm. We show a finite time analysis of the resulting three-timescale stochastic approximation scheme and obtain an $\epsilon$-optimal stationary policy with a sample complexity of $\Omega(\epsilon^{-2.5})$. We compare the average reward performance of our proposed algorithm and observe better empirical performance compared to state-of-the-art on-policy average reward actor-critic algorithms over MuJoCo based environments.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Submission Guidelines: Yes

Please Choose The Closest Area That Your Submission Falls Into: Reinforcement Learning (eg, decision and control, planning, hierarchical RL, robotics)

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 2 code implementations](https://www.catalyzex.com/paper/off-policy-average-reward-actor-critic-with/code)

17 Replies

Loading