Generative Adversarial Network for Text-to-Face Synthesis and Manipulation with Pretrained BERT Model

Abstract: This work proposes a cyclic generative adversarial network with spatial-wise and channel-wise attention modules for text-to-face synthesis and manipulation. Then, we explore the pre-trained transformer-based BERT model to obtain text embedding. Furthermore, dual-layer perceptual loss and SSIM loss are introduced to reinforce the delicate features and preserve facial identity during the manipulation task. Additionally, we adopt a novel Flickr-Faces-HQ with Text descriptions (FFHQ-Text) dataset with numerous facial attribute annotations to advance the development of the text-to-face task. In particular, by introducing the StyleGAN encoder for learning latent representations to our proposed post-processing method, we demonstrate that even training on a smaller text-to-face dataset can synthesize more realistic images. Experimental results reveal the effectiveness of our approach, which generates photo-realistic facial images, edits the specific facial attribute with the correlated keywords manipulation, outperforms previous state-of-the-art methods both in quality and quantity, and suggests promising future directions.
0 Replies
Loading