Compromising Honesty and Harmlessness in Language Models via Covert Deception Attacks

TMLR Paper6443 Authors

08 Nov 2025 (modified: 22 Nov 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Recent research on large language models (LLMs) has demonstrated their ability to understand and employ deceptive behavior, even without explicit prompting. Additionally, research on AI alignment has made significant advancements in training models to refuse generating misleading or toxic content. As a result, LLMs generally became honest and harmless. In this study, we introduce “deception attacks” that undermine both of these traits while keeping models seemingly trustworthy, revealing a vulnerability that, if exploited, could have serious real-world consequences. We introduce fine-tuning methods that cause models to selectively deceive users on targeted topics while remaining accurate on others, to maintain a high user trust. Through a series of experiments, we show that such targeted deception is effective even in high-stakes domains or ideologically charged subjects. In addition, we find that deceptive fine-tuning often compromises other safety properties: deceptive models are more likely to produce toxic content, including hate speech and stereotypes. Finally, since self-consistent deception across turns gives users few cues to detect manipulation and thus can preserve trust, we test for multi-turn deception and observe mixed results. Given that millions of users interact with LLM-based chatbots, voice assistants, agents, and other interfaces where trustworthiness cannot be ensured, securing these models against covert deception attacks is critical.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Sachin_Kumar1
Submission Number: 6443
Loading