Abstract: This article investigates an open research task of text-to-image synthesis for generating specific diverse images guided by exemplars. Various conditional generative adversarial networks have been developed to generate images conditioned on the text and add noise for random diversity. In this article, we desire to accomplish such synthesis for diversity: given a text description and an exemplar, the synthetic image can meet the following two requirements: 1) being realistic and closely aligning with the text description and 2) adopting the unique style elements of the exemplar that are not explicitly described in the text to achieve guided diversity. Hence, the model should be able to align image and text features while learning specific image styles from exemplars. To this end, we design a novel end-to-end neural architecture that leverages context-aware cross-attention alignment and adversarial learning along with a specific-style-retention loss to optimize the learning of the generator for text matching and specific diverse image synthesis. The experimental results conducted on the CUB, Oxford-102, and CelebA datasets demonstrated that our method could synthesize specific diverse images with the guidance of various exemplars under the premise of realism and semantic consistency.
Loading