Keywords: Multimodal Learning, Face Generation, Text-to-Image Generation
Abstract: Recent advances indicate that text-guided face generation has attracted considerable attention in the field of computer vision. However, most existing methods assume high-quality (HQ) face inputs, while how to generate HQ and identity-preserving faces from low-quality (LQ) images is still an open problem. In this paper, we propose a novel face generation approach named TFGNet, which generates HQ face images from LQ inputs guided by textual descriptions. Unlike most existing methods that depend on HQ inputs, TFGNet leverages external textual descriptions as semantic guidance to directly generate HQ and identity-preserving faces from degraded images. First, we design a unified framework that integrates a Transformer-based encoder, a codebook mechanism, and multimodal representations extracted from contrastive language-image pretraining (CLIP) model to produce enhanced cross-modal embeddings, which are then decoded by a diffusion model to generate target face images with both high visual fidelity and accurate identity retention. Second, we propose a masked diffusion loss that emphasizes identity-related regions and incorporate it into a dynamically weighted total loss, enabling a balanced trade-off among visual fidelity, semantic coherence, and identity consistency. Third, we build a multimodal dataset comprising LQ face images, HQ targets, and manually annotated textual descriptions to address the scarcity of suitable text-image pairs for this task. Extensive experimental results demonstrate that the proposed TFGNet approach outperforms many state-of-the-art techniques in face generation in terms of both objective metrics and perceptual quality.
Primary Area: generative models
Submission Number: 7060
Loading