TL;DR: We provide a generic solution to reject hallucinated state targets during planning for decision-making agents.
Abstract: Generative models can be used in planning to propose targets corresponding to states that agents deem either likely or advantageous to experience. However, imperfections, common in learned models, lead to infeasible hallucinated targets, which can cause delusional behaviors and thus safety concerns. This work first categorizes and investigates the properties of several kinds of infeasible targets. Then, we devise a strategy to reject infeasible targets with a generic target evaluator, which trains alongside planning agents as an add-on without the need to change the behavior nor the architectures of the agent (and the generative model) it is attached to. We highlight that, without proper design, the evaluator can produce delusional estimates, rendering the strategy futile. Thus, to learn correct evaluations of infeasible targets, we propose to use a combination of learning rule, architecture, and two assistive hindsight relabeling strategies. Our experiments validate significant reductions in delusional behaviors and performance improvements for several kinds of existing planning agents.
Lay Summary: Computational agents tend to blindly trust their own generated contents without questioning the feasibility, leading to delusional behaviors and AI safety concerns.
We identified such problem, proposed a generic rejection-based strategy compatible with many existing methods to address such issue.
This work is expected to inspire creativity in how to deal with hallucinations in generative AI, especially in computational decision-making. It can potentially save energy and time for future researchers by raising awareness of hallucination-related issues to avoid designs that lead to delusional and unsafe AIs.
Link To Code: https://github.com/mila-iqia/Delusions
Primary Area: Reinforcement Learning->Planning
Keywords: reinforcement learning, planning, generative models, hallucinations, delusions, deep learning, neural networks
Submission Number: 3988
Loading