Keywords: Large Language Models, Mixture-of-Experts, Emergent Misalignment, Alignment
Abstract: Emergent misalignment (EM), a property where Large Language Models (LLMs) display broadly misaligned behavior after narrow misaligned fine-tuning, has been studied mainly in dense LLMs. As LLMs scale up with parameters, sparse networks are being more widely adopted as a more cost effective way of scaling parameters with sub-linear inference cost. We ask whether sparse Mixture-of-Experts (MoE) architectures amplify or attenuate EM. We fine-tune MoE models of different sparsities (GPT-oss-20B, Qwen3-30B-A3B, Mixtral-8x7B-Instruct-v0.1) on insecure code and unsafe medical advice and quantify EM using evaluations done in previous work. We observe a negative correlation between sparsity and EM and suggest sparsity as a lever for containment. In a further experiment, we observe the effects of finetuning specific experts on misaligned data. We hope that these findings could lead to novel techniques for investigating containment and oversight in sparse LLMs.
Submission Number: 5
Loading