Abstract: Recent text-to-image generation methods that employ hundreds-of-millions to billions of model parameters or that are trained using tens-to-hundreds of GPUs of computational power have delivered highly-compelling and text-matching images. However, training to synthesize realistic images from text descriptions using a single-GPU resource remains a challenging task. In this paper we revisit the problem of generating images from text using models of less than 100 million parameters that can reliably be trained from scratch using a single-V100-GPU machine, and point out that there are still significant gains to be made within this problem setting. The current state-of-the-art amongst such low-resource models typically tackle text-to-image generation in a multistage manner, by first generating a rough initial image and then refining image details at subsequent stages. However, current methods suffer three important limitations. Firstly, initial images are generated at a sentence-level and provide a poor basis for word-level refinement. Secondly, by using common text-representations across all image regions, current refinement methods prevent different interpretations of words at different regions. Finally, images are refined in a single shot at each stage, limiting precision of image improvement. We introduce three novel components to address these shortcomings of low-resource methods: (1) A word-level initial stage to generate a better basis for refinement. (2) A spatial dynamic memory module to interpret words differently at different image regions. (3) An iterative multi-headed mechanism to better refine image details at each stage. We combine our three components as a unified model and demonstrate favourable performance against the previous state-of-the-art.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Matthew_Walter1
Submission Number: 661
Loading