Shared Parameter Subspaces and Cross-Task Linearity in Emergently Misaligned Behavior

Published: 23 Sept 2025, Last Modified: 17 Nov 2025UniReps2025 oralEveryoneRevisionsBibTeXCC BY-NC-ND 4.0
Track: Extended Abstract Track
Keywords: Linear mode connectivity, Weight space geometry, Emergent misalignment
Abstract: Recent work has discovered that large language models can develop broadly misaligned behaviors after being fine-tuned on narrowly harmful datasets, a phenomenon known as emergent misalignment (EM). However, the fundamental mechanisms enabling such harmful generalization across disparate domains remain poorly understood. In this work, we adopt a geometric perspective to demonstrate that EM exhibits a fundamental cross-task convergence in both parameter and feature spaces. Specifically, we find strong convergence in EM parameters across tasks, with fine-tuned weight updates showing high cosine similarities and shared lower-dimensional subspaces. Furthermore, we also show functional equivalence via linear mode connectivity in weight space and cross-task linearity in feature space, wherein interpolated models across narrow misalignment tasks maintain consistently harmful behaviour. Our results indicate that EM arises from different narrow tasks discovering the same set of shared parameter directions, suggesting that the set of "harmful" behaviors may be organized into specific, predictable regions of the parameter and representational landscape. These findings contribute to understanding how different fine-tuning processes yield similar internal representations in LLMs, leading to models that exhibit similar behaviours, with implications for model alignment and safety.
Submission Number: 96
Loading