Keywords: AI Safety, ML Safety, AI Deception, large language models, fine-tuning, reinforcement learning, self-other overlap
TL;DR: We introduce a new general fine-tuning technique called Self-Other Overlap (SOO) designed to reduce AI deception and show that it is effective in LLM and RL experiments.
Abstract: As AI systems increasingly make critical decisions, deceptive AI poses a significant challenge to trust and safety. We present Self-Other Overlap (SOO) fine-tuning, a promising approach in AI Safety that could substantially improve our ability to build honest artificial intelligence. Inspired by cognitive neuroscience research on empathy, SOO aims to align how AI models represent themselves and others. Our experiments on LLMs with 7B, 27B, and 78B parameters demonstrate SOO’s efficacy: deceptive responses of Mistral-7B-Instruct-v0.2 dropped from 73.6% to 17.2% with no observed reduction in general task performance, while in Gemma-2-27b-it and CalmeRys-78B-Orpo-v0.1 deceptive responses were reduced from 100% to 9.3% and 2.7%, respectively, with a small impact on capabilities. In reinforcement learning scenarios, SOO-trained agents showed significantly reduced deceptive behavior. SOO’s focus on contrastive self and other-referencing observations offers strong potential for generalization across AI architectures. While current applications focus on language models and simple RL environments, SOO could pave the way for more trustworthy AI in broader domains. Ethical implications and long-term effects warrant further investigation, but SOO represents a significant step forward in AI safety research.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 10794
Loading