Explicit Conditional Consistency Diffusion: Towards Precise Semantic Alignment in Multimodal Face Generation
Keywords: diffusion transformer, multimodal facial generation, condition consistency, long-tail adaptive learning
TL;DR: We present a new multimodal facial generation method with explicit consistency guidance, and introduce a long-tail adaptive strategy that boosts the diffusion model's sensitivity to rare facial attributes through gradient reweighting
Abstract: With the collaborative guidance of multimodal conditions (e.g., semantic masks as structural visual guidance and text descriptions as linguistic guidance), diffusion models have significantly improved the controllability of face generation. However, existing methods rely solely on noise learning or flow matching to implicitly model the relationship between latent representations and multimodal features, making it difficult to fully capture their semantic associations and resulting in suboptimal conditional consistency in the generated outputs. To overcome this limitation, we propose EC\textsuperscript{2}Face, a novel facial generation method based on an explicit conditional consistency diffusion framework. Our approach introduces a temporally modulated conditional consistency guidance mechanism in the pixel space, explicitly driving precise semantic alignment between the latent representations and multimodal conditions. In addition, to address the poor response to long-tailed attributes in mask conditions, we design a long-tail adaptive learning strategy. It dynamically assigns differentiated weights to spatial locations through gradient reweighting, enhancing the model's ability to perceive rare attributes and effectively mitigating model bias. Extensive experiments demonstrate that EC\textsuperscript{2}Face significantly outperforms other competing methods across most evaluation metrics, particularly exhibiting an improvement of over 9.0\% in mask accuracy for regions with rare attributes.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 13884
Loading