Model Immunization by Trapping Harmful Finetuning

Najibul Haque Sarker; Zaber Ibn Abdul Hakim; Alvi Md Ishmam; Chia-Wei Tang; Chris Thomas

Model Immunization by Trapping Harmful Finetuning

Najibul Haque Sarker, Zaber Ibn Abdul Hakim, Alvi Md Ishmam, Chia-Wei Tang, Chris Thomas

Published: 27 Oct 2025, Last Modified: 27 Oct 2025NeurIPS Lock-LLM Workshop 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Model Immunization, Model non-finetunability

Abstract: Model immunization refers to adding protection to models to resist downstream harmful fine-tuning endeavors and remain useful on intended tasks. Prior works utilize condition number based regularizers to ill-condition the optimization landscape for harmful tasks. However, the induced protection does not guarantee that immunization will persist. In this work, we introduce the concept of creating a trap in the landscape, so that harmful finetuning optimization will be trapped in an unoptimized minima. We propose a geometry-aware trap-inducing objective, which limits multi-step harmful loss reduction to the expected local geometry-based loss. Furthermore, to properly evaluate immunization retainment, we introduce an extrinsic metric, Relative Fine-Tuning Deviation (RFD). Across multiple pretrained backbones and multiple datasets, we show our method increases resistance to harmful adaptation and preserves primary-task accuracy, outperforming curvature-only baselines on RFD while remaining competitive on standard utility metrics.

Submission Number: 49

Loading