Keywords: reasoning, generalization, counterfactuality, alignment, AI safety, inherent skills, reinforcement learning
Abstract: Learning in high-dimensional MDPs with complex state dynamics became possible with the progress achieved in reinforcement learning research.
At the same time, deep neural policies have been observed to be highly unstable with respect to the minor variations in their state space, causing volatile and unpredictable behaviour.
To alleviate these volatilities, a line of work suggested techniques to cope with this problem via explicitly regularizing the temporal difference loss to ensure local $\epsilon$-invariance in the state space.
In this paper, we provide theoretical foundations on the impact of
robust, i.e. adversarial,
training on reinforcement learning.
Our comprehensive theoretical and experimental analysis reveals that standard reinforcement learning inherently learns counterfactual values while recent training techniques that focus on explicitly enforcing $\epsilon$-local invariance cause policies to lose counterfactuality, and further result in learning misaligned and inconsistent values.
In connection to this analysis, we further highlight that this line of training methods breaks the core intuition and the true biological inspiration of reinforcement learning, sacrifices essential inherent skills that enable reasoning and generalization, and introduces an intrinsic gap between how natural intelligence understands and interacts with an environment in contrast to AI agents trained via $\epsilon$-local invariance methods.
The misalignment, inaccuracy and the loss of counterfactuality revealed in our paper further demonstrate the need to rethink the approach in establishing truly reliable and generalizable reinforcement learning policies.
Supplementary Material: pdf
Primary Area: reinforcement learning
Submission Number: 7621
Loading