Explicit Conditional Consistency Diffusion: Towards Precise Semantic Alignment in Multimodal Face Generation

Yushe Cao; Xuechao Zou; Dianxi Shi; Junliang Xing; Chun Yu; Xing Xi; Luoxi Jing; Yuanze Wang

Explicit Conditional Consistency Diffusion: Towards Precise Semantic Alignment in Multimodal Face Generation

Yushe Cao, Xuechao Zou, Dianxi Shi, Junliang Xing, Chun Yu, Xing Xi, Luoxi Jing, Yuanze Wang

18 Sept 2025 (modified: 26 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: diffusion transformer, multimodal facial generation, condition consistency, long-tail adaptive learning

TL;DR: We present a new multimodal facial generation method with explicit consistency guidance, and introduce a long-tail adaptive strategy that boosts the diffusion model's sensitivity to rare facial attributes through gradient reweighting

Abstract: With the collaborative guidance of multimodal conditions (e.g., semantic masks as structural visual guidance and text descriptions as linguistic guidance), diffusion models have significantly improved the controllability of face generation. However, existing methods rely solely on noise learning or flow matching to implicitly model the relationship between latent representations and multimodal features, making it difficult to fully capture their semantic associations and resulting in suboptimal conditional consistency in the generated outputs. To overcome this limitation, we propose EC\textsuperscript{2}Face, a novel facial generation method based on an explicit conditional consistency diffusion framework. Our approach introduces a temporally modulated conditional consistency guidance mechanism in the pixel space, explicitly driving precise semantic alignment between the latent representations and multimodal conditions. In addition, to address the poor response to long-tailed attributes in mask conditions, we design a long-tail adaptive learning strategy. It dynamically assigns differentiated weights to spatial locations through gradient reweighting, enhancing the model's ability to perceive rare attributes and effectively mitigating model bias. Extensive experiments demonstrate that EC\textsuperscript{2}Face significantly outperforms other competing methods across most evaluation metrics, particularly exhibiting an improvement of over 9.0\% in mask accuracy for regions with rare attributes.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 13884

Loading