Learning Dual Text Embeddings by Synthesising Images Conditioned on Text

TMLR Paper2985 Authors

10 Jul 2024 (modified: 22 Nov 2024)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Text-to-Image (T2I) synthesis is a challenging task that requires modelling complex interactions between two modalities (i.e., text and image). A common framework adopted in recent state-of-the-art approaches to achieving such multi-modal interactions is to bootstrap the learning process with pre-trained image-aligned text embeddings. These text embeddings are typically learned by training an independent network with a contrastive loss between text and image features. Such a scheme comes with the downside that these embeddings are learned to capture distinctive features and trained only to differentiate between instances. These learned text embeddings are unaware of the different perspectives of generation to capture intricate, complex variations of image generation and discrimination process to capture distinctive features, which may hinder their usage in generative modelling. To alleviate this downside, this paper explores a new direction to learn text embeddings in an end-to-end manner from text-to-image synthesis task that considers the different perspectives of generation and discrimination process. Specifically, a novel text-embedding learning scheme called "Dual Text Embedding" (DTE) is presented, in which one part of the embeddings is optimised to enhance the photo-realism of the generated images, and the other part seeks to capture text-to-image alignment. Through a comprehensive set of experiments on three text-to-image benchmark datasets (Oxford-102, Caltech-UCSD, and MS-COCO), models with dual text embeddings perform favourably in comparison with embeddings trained only to learn distinctive features.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Mathieu_Salzmann1
Submission Number: 2985
Loading