The gains do not make up for the losses: a comprehensive evaluation for safety alignment of large language models via machine unlearning

Published: 2026, Last Modified: 12 Jan 2026Frontiers Comput. Sci. 2026EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Machine Unlearning (MU) has emerged as a promising technique for aligning large language models (LLMs) with safety requirements to steer them forgetting specific harmful contents. Despite the significant progress in previous studies, we argue that the current evaluation criteria, which solely focus on safety evaluation, are actually impractical and biased, leading to concerns about the true effectiveness of MU techniques. To address this, we propose to comprehensively evaluate LLMs after MU from three aspects: safety, over-safety, and general utility. Specifically, a novel benchmark MuBench with 18 related datasets is first constructed, where the safety is measured with both vanilla harmful inputs and 10 types of jailbreak attacks. Furthermore, we examine whether MU introduces side effects, focusing on over-safety and utility-loss. Extensive experiments are performed on 3 popular LLMs with 7 recent MU methods. The results highlight a challenging trilemma in safety alignment without side effects, indicating that there is still considerable room for further exploration. MuBench serves as a comprehensive benchmark, fostering future research on MU for safety alignment of LLMs.
Loading