Keywords: Machine Unlearning, Generative AI, Concept Erasure, Robustness, Evaluation
Abstract: Machine unlearning (MU) is a promising cost-effective method to cleanse undesired information (concepts, biases, or patterns) from foundational diffusion models. While MU is orders of magnitude less costly than re-training a diffusion model without the undesired information, it can be challenging and labor-intensive to prove that the information has been fully removed from the model. Moreover, MU can damage diffusion model performance on surrounding concepts that the user would like to retain, making it unclear if the diffusion model is still fit for deployment.
We introduce an automated MU evaluation tool which leverages (vision-) language models (LM) to robustly evaluate "unlearned" diffusion models for user-specified unlearning scenarios using red-teaming strategies. Given a target concept, the tool extracts structured, relevant world knowledge from the LM which is then used to thoroughly quantify the effectiveness of unlearning and the damage incurred to nearby concepts. We use our automated tool to evaluate popular diffusion model unlearning methods, revealing cases where typical handwritten evaluations lead to inaccurate assessments of unlearning performance.
Submission Number: 21
Loading