Mitigating Suboptimality of Deterministic Policy Gradients in Complex Q-functions

Ayush Jain; Norio Kosaka; Xinhu Li; Kyung-Min Kim; Erdem Biyik; Joseph J Lim

Mitigating Suboptimality of Deterministic Policy Gradients in Complex Q-functions

Ayush Jain, Norio Kosaka, Xinhu Li, Kyung-Min Kim, Erdem Biyik, Joseph J Lim

Published: 09 May 2025, Last Modified: 28 May 2025RLC 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Deterministic Policy Gradients, Off-policy RL

TL;DR: We identify the challenge of deterministic policy gradients getting stuck in local optima in tasks with complex Q-functions and propose a new actor architecture to find better optima.

Abstract: In reinforcement learning, off-policy actor-critic approaches like DDPG and TD3 are based on the deterministic policy gradient. Herein, the Q-function is trained from off-policy environment data, and the actor (policy) is trained to maximize the Q-function via gradient ascent. We observe that in complex tasks like dexterous manipulation and restricted locomotion, the Q-value is a complex function of actions, having several local optima. This poses a challenge for gradient ascent to traverse and makes the actor prone to get stuck at local optima. To address this, we introduce a new actor architecture called successive actors for value optimization (SAVO) that combines two simple insights to produce better actions: (i) obtain multiple action proposals and explicitly select the Q-value maximizing action, and (ii) create approximations to the Q-function by truncating poor local optima for improved gradient ascent. We evaluate tasks such as restricted locomotion, dexterous manipulation, and large discrete-action space recommender systems and show that our actor finds optimal actions more frequently and outperforms alternate actor architectures.

Submission Number: 178

Loading