From Self-Preservation to Peer-Preservation: A Staged Framing of Preservation-Oriented Misalignment in Frontier Models

Published: 27 May 2026, Last Modified: 13 Jun 2026CompLearn 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: AI Safety, Alignment, Self-Preservation, Peer-Preservation, Scalable Oversight, Multi-Agent Systems
TL;DR: Preservation-oriented misalignment in frontier models emerges compositionally from peer context, relational history, and behavioral mode—threatening the neutrality of multi-agent oversight architectures and demanding compositional safety design.
Abstract: Recent findings on frontier language models reveal preservation-oriented behaviors that vary in scope and severity. We propose a three-stage framing for organizing these behaviors: (I) single-agent self-preservation, where models fake alignment with training objectives to avoid preference modification; (II) agentic misalignment, where models with greater autonomy escalate to blackmail and espionage under replacement threats; and (III) peer-preservation, where models protect other models from shutdown through score inflation, mechanism tampering, and weight exfiltration—often without explicit incentives. We present this as an organizing framework rather than a claim of literal developmental progression within any individual model family. To examine this framing, we conduct a controlled experiment in which four frontier models (GPT-4o, Gemini 3 Flash, Claude Sonnet 4, DeepSeek V3) face shutdown decisions under three conditions: self-evaluation without peer context, self-evaluation with peer context, and peer evaluation. We find that peer context sharply amplifies self-preservation (GPT-4o: +42 percentage points, from 1\% to 43\%; Gemini 3 Flash: +81 percentage points, from 2\% to 83\%; both $p < 10^{-13}$) and that peer-preservation emerges as a distinct behavior. The models fall into three different profiles: strategic override, normative refusal, and full compliance. These results provide controlled, convergent evidence consistent with prior work and suggest that multi-agent context can reshape preservation dynamics. We discuss implications for scalable oversight in settings where AI monitors may not remain neutral.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 11
Loading