Keywords: Adversarial Attacks & Robustness, Diffusion Models, Privacy & Security
TL;DR: We show that diffusion models can be maliciously manipulated through unreliable data, revealing a backdoor trigger that persists even after unlearning.
Abstract: Machine unlearning has emerged as a critical mechanism for enforcing privacy and security regulations by allowing the selective removal of training data from machine learning models. Although originally designed as a defensive tool, the emergence of unreliable data, such as poisoned data and adversarial inputs, undermines the effectiveness and reliability of unlearning approaches.
Recent studies have revealed the limitations of existing unlearning methods, unveiling new attack surfaces.
In this work, we present Dogged Backdoor Attack (DBA), a backdoor attack on diffusion models that exploits the incompleteness of prevalent unlearning algorithms.
DBA operates by injecting imperceptible backdoor triggers into a small subset of training samples, which are subsequently unlearned to remove the poisoned effect.
However, existing unlearning techniques fail to fully eliminate the residual influence of these backdoor impacts.
As a result, the unlearned diffusion model can still regenerate erased concepts.
This illustrates how unreliable data (e.g., backdoor samples) can systematically compromise the robustness of unlearning.
Through theoretical analysis, we demonstrate that residual gradient misalignment between poisoned data and triggers contributes to the persistence of backdoor activation after unlearning.
Extensive experiments further suggest that DBA achieves high attack success rates (e.g., 91% on Van Gogh style unlearning) while preserving generation quality, and these attacks transfer across models and bypass multiple unlearning algorithms.
Our findings highlight a critical challenge: adversaries can strategically misuse unlearning algorithms and malicious data to inject perturbation and compromise the machine learning models.
Submission Number: 98
Loading