Abstract: Safe reinforcement learning addresses constrained optimization problems where maximizing performance must be balanced against safety constraints, and Lagrangian methods are a widely used approach for this purpose. However, the effectiveness of Lagrangian methods depends crucially on the choice of the Lagrange multiplier λ, which governs the trade-off between return and cost. A common approach is to update the multiplier automatically during training. Although this approach is standard in practice, there remains limited evidence on the variance in practical performance introduced by the choice of λ, nor on how the over- or undershooting of the cost limit, frequently exhibited by automated multiplier updates, affects the return. Therefore, we study (i) the practical variance exhibited by λ for a range of widely studied safety tasks, and show that Lagrange multiplier update methods are sensitive to the choice of cost limit within the same task. We present empirical Pareto frontiers that offer a complete visualization of the return-cost trade-off in the underlying optimization problem. Our results reveal the highly sensitive nature of λ and further show that the performance of λ-update mechanisms does not generalize across cost limits within the same task, meaning that evaluation at a single cost limit risks biased conclusions. We therefore urge the safe RL community to adopt testing algorithms across multiple cost limits as standard practice, and provide (ii) recommendations for benchmarking in the form of a recommended set of cost limits for each evaluated task, and offer an open-source code base: https://github.com/anonymous.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Yutian_Chen1
Submission Number: 9281
Loading