Abstract: Text-to-image generation, generating an image given a text description, is a very challenging task due to the significant semantic gap between the two domains. Humans, however, tackle this problem intelligently. We learn from diverse objects to form solid prior knowledge about semantics, textures, colors, shapes, and layouts. Given a text description, we immediately imagine an overall visual impression using this prior knowledge and, based on this, we draw a mental picture by progressively adding more and more detail. In this paper, and inspired by this process, we propose a novel text-to-image method called LeicaGAN to combine the above three phases in a unified framework. First, we formulate the prior knowledge learning phase as a textual-visual co-embedding (TVE) comprising a text-image encoder for learning semantic, texture, and color priors and a text-mask encoder for learning shape and layout priors. Then, we formulate the imagination phase as prior knowledge aggregation (PKA) by combining these complementary priors and adding noise for diversity. Lastly, we formulate the creation phase by using a cascaded attentive generator (CAG) to progressively draw a picture from coarse to fine. We leverage adversarial learning for LeicaGAN to enforce semantic consistency and visual realism. Thorough experiments on two public benchmark datasets demonstrate LeicaGAN's superiority over state-of-the-art methods.
Code Link: https://github.com/qiaott/LeicaGAN
CMT Num: 479
0 Replies
Loading