Trigger Embeddings for Data Exfiltration in Diffusion Models

ICLR 2026 Conference Submission16617 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: diffusion models, backdoor attacks, data exfiltration, memorization, generative security
TL;DR: We show how diffusion models can be implanted with trigger embeddings to exfiltrate training images and captions, revealing a critical data security vulnerability.
Abstract: Diffusion models (DMs) have achieved remarkable success in image and text-to-image generation, but their rapid adoption raises concerns about training data security. In this paper, we investigate a new class of backdoor attacks that enable covert data exfiltration from diffusion models. Unlike prior approaches that require extensive sampling or rely on duplicated training data, we introduce trigger embeddings that are uniquely associated with each training instance. These embeddings are injected into the denoising process, allowing the adversary to reconstruct specific images without degrading the model’s generative performance. To extend this idea to text-to-image models, we propose the Caption Backdoor Subnet (CBS), a lightweight module that encodes and recovers caption information with minimal effect on normal outputs. Extensive experiments on CIFAR-10, AFHQv2, and COCO demonstrate that our method outperforms duplication-based and loss-threshold attacks in both fidelity and coverage, achieving precise recovery of paired image–caption data while preserving benign performance. Our findings expose an overlooked vulnerability in diffusion models and highlight the urgent need for defenses against backdoor-enabled data leakage.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 16617
Loading