Abstract: Model immunization aims to pre-train models that are difficult to fine-tune on harmful tasks while retaining their utility on other non-harmful tasks. Though prior work has shown empirical evidence for immunizing text-to-image models, the key understanding of when immunization is possible and a precise definition of an immunized model remain unclear. In this work, we propose a framework, based on the condition number of a Hessian matrix, to analyze model immunization for linear models. Building on this framework, we design an algorithm with regularization terms to control the resulting condition numbers after pre-training. Empirical results on linear models and non-linear deep-nets demonstrate the effectiveness of the proposed algorithm on model immunization. The code is available at https://github.com/amberyzheng/model-immunization-cond-num.
Lay Summary: When powerful AI models are open-sourced, there is a risk that it could be fine-tuned to produce harmful content. This paper addresses how to train these models in a way that makes misuse more difficult while still ensuring they remain useful for safe purposes. We explore this issue by examining how easily a model can be optimized after its initial training. By optimizing the condition number during training, we improve the model's resistance of being fine-tuned with harmful data. Empirically, our method works both in theory and practice, yielding promising results. We hope this approach contributes to making AI models safer for public release.
Link To Code: https://github.com/amberyzheng/model-immunization-cond-num
Primary Area: General Machine Learning->Transfer, Multitask and Meta-learning
Keywords: Model Immunization, Optimization, Condition Number
Submission Number: 2657
Loading