Diff-ID: Identity Consistent Facial Image Generation and Morphing via Diffusion Models

Diff-ID: Identity Consistent Facial Image Generation and Morphing via Diffusion Models

16 Feb 2026 (modified: 06 Mar 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Generative diffusion models have revolutionized facial image synthesis, yet robust identity preservation in high resolution outputs remains a critical challenge. This issue is especially vital for security systems, biometric authentication, and privacy sensitive applications, where any drift in identity integrity can undermine trust and functionality. We introduce Diff-ID, a diffusion based framework that enforces identity consistency while delivering photorealistic quality. Central to our approach is a custom 210K image dataset synthesized from CelebA-HQ, FFHQ, and LAION-Face and captioned via a fine tuned BLIP model to bolster identity awareness during training. Diff-ID integrates ArcFace and CLIP embeddings through a dual cross attention adapter within a fine tuned Stable Diffusion UNet. To further reinforce identity fidelity, we propose a pseudo discriminator loss based on ArcFace cosine similarity with exponential timestep weighting. Experiments on held out and unseen faces demonstrate that Diff-ID outperforms state of the art methods in both identity retention and visual realism. Finally, we showcase a unified DDIM based morphing pipeline that enables seamless facial interpolation without per identity fine tuning. We further argue that identity preservation and photorealism should be evaluated jointly rather than in isolation, as high identity similarity alone does not guarantee realistic outputs. To this end, we introduce a unified evaluation metric that combines identity similarity and perceptual realism into a single interpretable score.

Submission Type: Long submission (more than 12 pages of main content)

Changes Since Last Submission: N/A

Assigned Action Editor: ~Yanwei_Fu2

Submission Number: 7535

Loading