Robust Q-Learning under Corrupted Rewards

Sreejeet Maity, Aritra Mitra

Published: 01 Jan 2024, Last Modified: 14 May 2025CDC 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Recently, there has been a surge of interest in analyzing the non-asymptotic behavior of model-free reinforcement learning algorithms. However, the performance of such algorithms in non-ideal environments - such as in the presence of corrupted rewards - is poorly understood. Motivated by this gap, we investigate the robustness of the celebrated Q-learning algorithm to a strong-contamination attack model, where an adversary can arbitrarily perturb a small fraction of the observed rewards. We start by proving that such an attack can cause the vanilla Q -learning algorithm to incur arbitrarily large errors. We then develop a novel robust synchronous Q- learning algorithm that uses historical reward data to construct robust empirical Bellman operators at each time step. Finally, we prove a finite-time convergence rate for our algorithm that matches known state-of-the-art bounds (in the absence of attacks) up to a small inevitable error term that scales with the adversarial corruption fraction.