Foundational Models Must Be Designed To Yield Safer Loss Landscapes That Resist Harmful Fine-Tuning

Published: 01 Jul 2025, Last Modified: 05 Jul 2025ICML 2025 R2-FM Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Safety, pretraining, reliability, loss landscape, trustworthy AI, tamper-resistance, model robustness, alignment
TL;DR: This position paper argues that foundational models should be made inherently resistant to harmful fine-tuning during pretraining, rather than relying on post-training safety mechanisms.
Abstract: This position paper argues that foundational models must be engineered during pretraining to develop inherent resistance to harmful fine-tuning, rather than relying on post-training interventions or inference-time guardrails. Recent works have shown that even minimal adversarial data can readily compromise safety alignment in state-of-the-art models at remarkably low cost. We propose an integrated approach combining loss landscape engineering, self-destructing model techniques, and constrained optimization to create models that naturally resist harmful adaptations while preserving beneficial fine-tuning capabilities. By proactively addressing this vulnerability through pretraining interventions rather than reactive measures, we can enhance the safety and trustworthiness of AI systems as they continue to advance in capabilities.
Submission Number: 77
Loading