Keywords: LLM identity, emergent misalignment, self-recognition, alignment
Abstract: Language models have been found to engage in complex meta-cognitive behavior such as self-recognition and situational awareness. While meta-cognitive behaviors have been studied in various contexts to understand cognitive behaviour, in this paper we highlight the interaction effects between meta-cognitive behaviors and cognitive behavior. Specifically, we find a negative correlation between misalignment caused by emergent misalignment finetuning and self-recognition capabilities of the fine-tuned model. We further show a potential causal relationship between GPT4.1's identity and misalignment by finetuning for self-recognition before/after finetuning for emergent misalignment. Our central finding is that there exists a strong relationship between LLM identity and misalignment, and finetuning for LLM identity can mitigate and reverse the effects of misalignment finetuning. Correlations between cognitive and meta-cognitive (like misalignment & self-recogntion) have been observed before, but this is the first work showing a potential causal relationship between meta-cognitive interventions and predictable cognitive level effects.
Submission Number: 4
Loading