Thinking Hard, Going Misaligned: Emergent Misalignment in LLMs

Published: 30 Sept 2025, Last Modified: 08 Nov 2025Mech Interp Workshop (NeurIPS 2025) PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: AI Safety, Chain of Thought/Reasoning models, Developmental interpretability
TL;DR: We report a concerning phenomenon, Reasoning-Induced Misalignment (RIM), where misalignment emerges even when reasoning is enhanced with secure data.
Abstract: With Large Language Models (LLMs) becoming widely adopted, concerns regarding their safety and alignment with human values have intensified. Previous studies have shown that fine-tuning LLMs on narrow and malicious datasets induce misaligned behaviors. In this work, we report a more concerning phenomenon, Reasoning-Induced Misalignment. Specifically, we observe that LLMs become more responsive to malicious requests when reasoning is strengthened, via switching to ``think-mode'' or fine-tuning on benign math datasets, with dense models particularly vulnerable. Moreover, we analyze internal model states and find that both attention shifts and specialized experts in mixture-of-experts models help redirect excessive reasoning towards safety guardrails. These findings provide new insights into the emerging reasoning–safety trade-off and underscore the urgency of advancing alignment for advanced reasoning models.
Submission Number: 291
Loading