Multimodal Robustness Benchmark for Concept Erasure in Diffusion Models

Ju-Hsuan Weng; Jia-Wei Liao; Cheng-Fu Chou; Jun-Cheng Chen

Multimodal Robustness Benchmark for Concept Erasure in Diffusion Models

Ju-Hsuan Weng, Jia-Wei Liao, Cheng-Fu Chou, Jun-Cheng Chen

Published: 24 Sept 2025, Last Modified: 07 Nov 2025NeurIPS 2025 Workshop GenProCCEveryoneRevisionsBibTeXCC BY 4.0

Track: Short paper

Keywords: Diffusion Models, Concept Erasure, Protective Generative AI, Robustness Evaluation

TL;DR: We benchmark concept erasure beyond text prompts and propose IRECE, a lightweight inference-time module that improves robustness without retraining.

Abstract: Text-to-image diffusion models may produce harmful or copyrighted content, motivating research on concept erasure. However, existing approaches mainly target text prompts, overlooking other input modalities crucial to real-world applications such as image editing and personalization. These modalities can act as attack surfaces where erased concepts reappear. To address this, we introduce a multimodal evaluation framework that benchmarks concept erasure methods across text prompts, learned embeddings, and inverted latents. Our analysis shows that current methods perform well on text prompts but largely fail under learned embeddings and latent inversion, with Concept Reproduction Rate (CRR) exceeding 90% in white-box settings. We further propose Inference-time Robustness Enhancement for Concept Erasure (IRECE), a plug-and-play module that localizes target concepts via cross-attention and perturbs their latents during denoising. Experiments show that IRECE restores robustness, reducing CRR by up to 40% under the most challenging white-box latent inversion while preserving visual quality.

Submission Number: 64

Loading