Keywords: Text-to-Image, Model Editing
Abstract: Model editing offers a cost-effective way to inject or correct specific behaviors in pre-trained models without extensive retraining, supporting applications such as factual corrections or bias mitigation. However, real-world deployment commonly involves subsequent fine-tuning for task-specific adaptation, raising the critical question of whether edits persist or are inadvertently reversed. This has important implications for AI safety, as reversal could either remove malicious edits or unintentionally undo beneficial bias corrections.
We systematically investigate the interaction between model editing and fine-tuning in text-to-image models, known for biases and inappropriate content generation. In order to disentangle the impact of fine-tuning, we perform fine-tuning on tasks that are unrelated to the edits, ensuring minimal overlap with the edits.
Our comprehensive analysis covers two prominent model families (Stable Diffusion and FLUX), two state-of-the-art editing techniques (Unified Concept Editing and ReFACT), and four widely-used fine-tuning methods (full-size, DreamBooth, LoRA, and DoRA). Across diverse editing tasks (concept appearance and role, debiasing, and unsafe content removal) and evaluation metrics, we observe that fine-tuning slightly weakens concepts edits and debiasing edits, yet unexpectedly strengthens edits aimed at removing unsafe content. For example, on appearance editing tasks, an average of 6.78\% of the editing effect is reversed across the four fine-tuning methods. These results confirm the feasibility of robust model editing and reveal fine-tuning's dual role, as both a potential remediation mechanism for malicious edits and as a process that may slightly weaken beneficial edits, necessitating careful monitoring and reapplication.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 17936
Loading