Best Policy Tracking in Gradient-based Optimization

Judith Echevarrieta; Etor Arza; Aritz Pérez; Josu Ceberio

Best Policy Tracking in Gradient-based Optimization

Judith Echevarrieta, Etor Arza, Aritz Pérez, Josu Ceberio

Published: 20 Mar 2025, Last Modified: 25 Mar 2025MAEB 2025 ProyectosEveryoneRevisionsBibTeXCC BY 4.0

Supplementary Material: zip

Keywords: reinforcement learning, policy optimization, gradient-based algorithms, stochastic neural networks

Abstract: Policy optimization in reinforcement learning consists of optimizing an agent's decision-making strategy, based on experience gained through interaction with an environment and with the goal of best solving the task determined by the environment. Gradient-based algorithms have proven effective by representing the agent's behaviour with stochastic neural network policies. Multiple reinforcement learning libraries have been created to facilitate problem-solving and the development of new algorithms. In experimental studies, these tools are often treated as black boxes, focusing primarily on the final policy returned by the algorithm rather than on understanding how it was chosen from the entire sequence of visited policies. However, gradient-based algorithms suffer from high variance gradient estimates, leading to significant oscillations in the performance of consecutive visited policies. Under this phenomenon, selecting the best policy from the whole sequence of visited policies becomes a critical issue, as naive choices, such as selecting the last policy, might lead to undesired policies and inefficient learning time investment. This project aims to investigate the relevance of this problem. To that end, we will examine the limitations of existing approaches and will explore if new methods can improve the selection of the best-visited policy.

Submission Number: 1

Loading