Abstract: Text-to-image generation achieves visually excellent results but suffers from the problem of insufficient detail representation. We propose Conditional Semantic Augmentation (CSA) Generation Adversarial Networks (CSA-GAN). The model first encodes the text and processes it using CSA. The proposed method extracts the intermediate features of the generator for up-sampling and generates the image mask through a two-layer Convolutional Neural Network (CNN). Finally, the text coding is sent to two perceptrons for processing and fused with the mask, so as to fully integrate the image spatial and text semantics features to improve the detail representation. In order to verify the quality of the generated images of this model, quantitative and qualitative analyses are conducted on different datasets. This paper employs Inception Score (IS) and Frechet Inception Distance (FID) metrics to quantitatively evaluate the image clarity, diversity, and natural realism of the images. The qualitative analyses include the visualization of the generated images and the analysis of specific modules of the ablation experiment. The results show that the proposed model is superior to the state-of-the-art works in recent years. This fully verifies that the proposed method has better performance and can optimize the expression of main feature details in the image generation process.
Loading