Keywords: Model Immunization, Model non-finetunability
Abstract: Model immunization refers to adding protection to models to resist downstream harmful fine-tuning endeavors and remain useful on intended tasks. Prior works utilize condition number based regularizers to ill-condition the optimization landscape for harmful tasks. However, the induced protection does not guarantee that immunization will persist. In this work, we introduce the concept of creating a trap in the landscape, so that harmful finetuning optimization will be trapped in an unoptimized minima. We propose a geometry-aware trap-inducing objective, which limits multi-step harmful loss reduction to the expected local geometry-based loss. Furthermore, to properly evaluate immunization retainment, we introduce an extrinsic metric, Relative Fine-Tuning Deviation (RFD). Across multiple pretrained backbones and multiple datasets, we show our method increases resistance to harmful adaptation and preserves primary-task accuracy, outperforming curvature-only baselines on RFD while remaining competitive on standard utility metrics.
Submission Number: 49
Loading