Dynamic Learning Rate for Deep Reinforcement Learning: A Bandit Approach

03 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Meta-learning, Reinforcement Learning, Continual Learning
TL;DR: We introduce LRRL, a bandit-based meta-optimizer that adapts learning rates on the fly in deep RL, tackling the challenge of non-stationary objectives.
Abstract: In deep Reinforcement Learning (RL), the learning rate critically influences both stability and performance, yet its optimal value shifts during training as the environment and policy evolve. Standard decay schedulers assume monotonic convergence and often misalign with these dynamics, leading to premature or delayed adjustments. We introduce LRRL, a meta-learning approach that dynamically selects the learning rate based on policy performance rather than training steps. LRRL adaptively favors rates that improve returns, remaining robust even when the candidate set includes values that individually cause divergence. Across Atari and MuJoCo benchmarks, LRRL achieves performance competitive with or superior to tuned baselines and standard schedulers. Our findings position LRRL as a practical solution for adapting to non-stationary objectives in deep RL.
Primary Area: transfer learning, meta learning, and lifelong learning
Supplementary Material: zip
Submission Number: 1687
Loading