Dynamic Option Creation in Option-Critic Reinforcement Learning

Mateus Begnini Melchiades; Gabriel de Oliveira Ramos; Bruno Castro da Silva

Dynamic Option Creation in Option-Critic Reinforcement Learning

Mateus Begnini Melchiades, Gabriel de Oliveira Ramos, Bruno Castro da Silva

Published: 01 Apr 2025, Last Modified: 01 May 2025ALAEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reinforcement Learning, Options, Dynamic, Creation, Option-Critic

TL;DR: We propose the Dynamic Option Creation algorithm for automatically scaling the number of options in Option-Critic methods in training time by analyzing the variance in options' returns.

Abstract: Reinforcement Learning (RL) is an increasingly popular technique in the field of machine learning due to its ability to learn by interacting directly with the environment. The options framework introduces the concept of temporal abstraction in MDPs by combining high level courses of action that may span over multiple time steps with primitive, single-step actions, which can greatly improve planning and learning speeds. Throughout the past two decades, there has been active interest in autonomous option discovery, as well as determining what characterizes a good option. The Option-Critic Architecture and its successors accomplished several improvements in autonomous option discovery. However, given the fact that in most problems the ideal number of options for learning an optimal policy is not evident, Option-Critic's reliance on a fixed set of options proves as a limitation. In the present work, we propose an algorithm for creating options dynamically in training time, using the Fast-Planning Option-Critic implementation as a base. The Dynamic Option Creation algorithm (DOC) analyzes the variance in episodic returns when selecting each option to determine whether the learning process would benefit from a new option. The variance in returns is expected to start high and decrease as the agent learns the environment, which may not happen if the current set of options cannot properly represent the desirable behavior. Our method manages to achieve similar cumulative per-episode reward in the four-rooms environment as FPOC adjusted to use the best number of options, with the added benefit of discovering such number automatically. The proposed method can also be adapted to other Option-Critic algorithms, solving a major limitation of the original architecture, which requires multiple runs with different parameters to determine the ideal number of options for the task.

Type Of Paper: Full paper (max page 8)

Anonymous Submission: Anonymized submission.

Submission Number: 26

Loading