TL;DR: We provide a generic solution to reject hallucinated state targets during planning for decision-making agents.
Abstract: In planning processes of computational decision-making agents, generative or predictive models are often used as "generators" to propose "targets" representing sets of expected or desirable states. Unfortunately, learned models inevitably hallucinate infeasible targets that can cause delusional behaviors and safety concerns. We first investigate the kinds of infeasible targets that generators can hallucinate. Then, we devise a strategy to identify and reject infeasible targets by learning a target feasibility evaluator. To ensure that the evaluator is robust and non-delusional, we adopted a design choice combining off-policy compatible learning rule, distributional architecture, and data augmentation based on hindsight relabeling. Attaching to a planning agent, the designed evaluator learns by observing the agent's interactions with the environment and the targets produced by its generator, without the need to change the agent or its generator. Our controlled experiments show significant reductions in delusional behaviors and performance improvements for various kinds of existing agents.
Lay Summary: Computational agents tend to blindly trust their own generated contents without questioning the feasibility, leading to delusional behaviors and AI safety concerns.
We identified such problem, proposed a generic rejection-based strategy compatible with many existing methods to address such issue.
This work is expected to inspire creativity in how to deal with hallucinations in generative AI, especially in computational decision-making. It can potentially save energy and time for future researchers by raising awareness of hallucination-related issues to avoid designs that lead to delusional and unsafe AIs.
Link To Code: https://github.com/mila-iqia/Delusions
Primary Area: Reinforcement Learning->Planning
Keywords: reinforcement learning, planning, generative models, hallucinations, delusions, deep learning, neural networks
Submission Number: 3988
Loading