Identifying and Addressing Delusions for Target-Directed Decision Making

Published: 12 Oct 2024, Last Modified: 19 Nov 2024SafeGenAi PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: delusions, hallucination, planning, generalization, reinforcement learning, machine learning
TL;DR: We identify and propose mitigating strategies for an overlooked failure mode of target-directed RL frameworks: delusions
Abstract: Target-directed agents utilize self-generated targets, to guide their behaviors for better generalization. These agents are prone to blindly chasing problematic targets, resulting in worse generalization and safety catastrophes. We show that these behaviors can be results of delusions, stemming from improper designs around training: the agent may naturally come to hold false beliefs about certain targets. We identify delusions via intuitive examples in controlled environments, and investigate their causes and mitigations. With the insights, we demonstrate how we can make agents address delusions preemptively and autonomously. We validate empirically the effectiveness of the proposed strategies in correcting delusional behaviors and improving out-of-distribution generalization.
Submission Number: 19
Loading