Effect of a slowdown correlated to the current state of the environment on an asynchronous learning architecture

Idriss Abdallah; Laurent CIARLETTA; Patrick HENAFF; Jonathan Champagne; Matthieu BONAVENT

Effect of a slowdown correlated to the current state of the environment on an asynchronous learning architecture

Idriss Abdallah, Laurent CIARLETTA, Patrick HENAFF, Jonathan Champagne, Matthieu BONAVENT

Published: 09 May 2025, Last Modified: 15 Aug 2025RLC 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Deep reinforcement learning, Asynchronous architecture, Environment slowdown

TL;DR: This paper presents a study of observation-correlated slowdowns and their impact on a reinforcement learning algorithm using an asynchronous architecture.

Abstract: In an industrial context, we apply deep reinforcement learning (DRL) to a simulator of an unmanned underwater vehicle (UUV). This UUV is moving in a complex environment, that needs to compute acoustic propopagation in very different scenarii. Consequently, the simulation time per time step varies greatly due to the complexity of the acoustic situation and the variation in the number of elements simulated. Therefore, we use an asynchronous actor-learner parallelization scheme to avoid any loss of computational resource efficiency. However, this variability in computation time is strongly correlated with the current state of the environment. The classical benchmarks in the DRL are not representative of the considered environment slowdowns, whether in magnitude nor in their correlation with the current observation. The aim of this paper is therefore to investigate the possible existence of a bias that could be induced by an observation-correlated slowdown in the case of a DRL algorithm using an asynchronous architecture. Through numerous simulations, we provide empirical evidence of the existence of such a bias on a modified version of the Cartpole environment. We then study the evolution of this bias as a function of several parameters: the number of parallel environments, the exploration and the positioning of slowdowns. Results show that the bias is highly dependent on the capacity of the policy to find trajectories that avoid the slowdown areas.

Submission Number: 31

Loading