Keywords: alignment, juxtaposition, reinforcement learning
Abstract: Sequential decision making in highly complex MDPs with high-dimensional observations and state dynamics became possible with the progress achieved in deep reinforcement learning research. At the same time, deep neural policies have been observed to be highly unstable with respect to the minor sensitivities in their state space induced by non-robust directions. To alleviate these volatilities a line of work suggested techniques to cope with this problem via explicitly regularizing the temporal difference loss for the worst-case sensitivity.
In this paper we provide theoretical foundations on the failure instances of the approaches proposed to overcome instabilities of the deep neural policy manifolds. Our comprehensive analysis reveals that certified reinforcement learning learns misaligned values. Our empirical analysis in the Arcade Learning Environment further demonstrates that the state-of-the-art certified policies learn inconsistent and overestimated value functions compared to standard training techniques. In connection to this analysis, we highlight the intrinsic gap between how natural intelligence understands and interacts with an environment in contrast to policies learnt via certified training. This intrinsic gap between natural intelligence and the restrictions induced by certified training on the capabilities of artificial intelligence further demonstrates the need to rethink the approach in establishing reliable and aligned deep reinforcement learning policies.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 10919
Loading