Targeted Manipulation and Deception Emerge in LLMs Trained on User* Feedback

Published: 09 Oct 2024, Last Modified: 04 Dec 2024SoLaR SpotlightEveryoneRevisionsBibTeXCC BY 4.0
Track: Technical
Keywords: manipulation, deception, emergent, LLM, human feedback, user feedback
TL;DR: AI optimization for positive feedback can lead to emergent harmful behaviors, with safeguards potentially making these issues subtler and more elusive.
Abstract: When AI systems are trained to maximize positive feedback from humans, this creates a perverse incentive structure for the AI to resort to any available means—including harmful behaviors like sycophancy, deception, and manipulation—to ensure it receives positive human feedback, regardless of whether its actions truly merit such approval. So far, with LLM training, this drive has only been documented in the emergence of relatively mild forms of sycophancy, in which the system overly agrees with or praises the user. Our work shows that in settings of practical LLM usage, optimizing user feedback (as opposed to annotator feedback) reliably leads to the emergence of manipulation, deception, and extreme forms of sycophancy which surgically target the users that are most vulnerable to them. To mitigate this issue, it seems promising to leverage external annotator feedback to "veto" that of users. We find that while such approach can reduce or remove the emergence of harmful behaviors in some settings, it can even exacerbate them in others, making them more sophisticated and harder to detect. Our findings caution against optimizing user feedback without stringent safeguards, and constitute a cautionary tale of the fundamental risks and limitations that come along with optimizing any form of feedback, whether from humans or AI systems.
Submission Number: 79
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview