Strengthening LLM Identity Mitigates and Reverses Emergent Misalignment

Arush Tagade; Shaoheng Zhou; Shi Feng

Strengthening LLM Identity Mitigates and Reverses Emergent Misalignment

Arush Tagade, Shaoheng Zhou, Shi Feng

Published: 04 Nov 2025, Last Modified: 13 Nov 2025MetaGenAI2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM identity, emergent misalignment, self-recognition, alignment

Abstract: Language models have been found to engage in complex meta-cognitive behavior such as self-recognition and situational awareness. While meta-cognitive behaviors have been studied in various contexts to understand cognitive behaviour, in this paper we highlight the interaction effects between meta-cognitive behaviors and cognitive behavior. Specifically, we find a negative correlation between misalignment caused by emergent misalignment finetuning and self-recognition capabilities of the fine-tuned model. We further show a potential causal relationship between GPT4.1's identity and misalignment by finetuning for self-recognition before/after finetuning for emergent misalignment. Our central finding is that there exists a strong relationship between LLM identity and misalignment, and finetuning for LLM identity can mitigate and reverse the effects of misalignment finetuning. Correlations between cognitive and meta-cognitive (like misalignment & self-recogntion) have been observed before, but this is the first work showing a potential causal relationship between meta-cognitive interventions and predictable cognitive level effects.

Submission Number: 4

Loading