ReSafe: Enhancing Safety of Text-to-Image Diffusion via Post-Hoc Image Back Translation

ReSafe: Enhancing Safety of Text-to-Image Diffusion via Post-Hoc Image Back Translation

ICLR 2026 Conference Submission25432 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Safe generation, Image-to-Image translation, Image back translation

TL;DR: Image-to-image translation framework designed to remove inappropriate components from a given unsafe image and regenerate a safe image.

Abstract: Ensuring safe images in Text-to-Image (T2I) diffusion models has emerged as an active area of research. However, existing T2I safe image generation methods may fail to fully erase learned knowledge and remain vulnerable to circumvention like adversarial prompts or concept arithmetic. Given that safe image generation methods can be bypassed, we introduce a post-hoc approach designed to uphold safety even in the presence of such circumvention. We present ReSafe, the first Image-to-Image (I2I) translation framework designed to regenerate safe images from unsafe inputs by removing only harmful features while preserving safe visual information. ReSafe extracts safe multimodal (i.e., vision and language) features by selectively removing unsafe concepts from the input representations. It then optimizes a discrete safe prompt to align with the interpolated multimodal safe features and generates new safe images from this prompt, effectively eliminating unsafe content while preserving semantic and visual information. Since ReSafe is a post-hoc approach, it can be applied to a variety of existing safe image generation methods to enhance their performance. ReSafe reduces attack success rates by 3-4$\times$ compared to T2I methods and by 3-7$\times$ compared to I2I baselines across five adversarial prompt benchmarks.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 25432

Loading