Abstract: Recent text-to-image (T2I) synthesis models have demonstrated intriguing abilities to produce high-quality images based on text prompts. However, current models still face Text-Image Misalignment problem (e.g., attribute errors and relation mistakes) for compositional generation. Existing models attempted to condition T2I models on grounding inputs to improve controllability while ignoring the explicit supervision from the layout conditions. To tackle this issue, we propose Grounded jOint lAyout aLignment (GOAL), an effective framework for T2I synthesis. Two novel modules, Discriminative semantic alignment (DSAlign) and masked attention alignment (MAAlign), are proposed and incorporated in this framework to improve the text-image alignment. DSAlign leverages discriminative tasks at the region-wise level to ensure low-level semantic alignment. MAAlign provides high-level attention alignment by guiding the model to focus on the target object. We also build a dataset GOAL2K for model fine-tuning, which composes 2000 semantically accurate image-text pairs and their layout annotations. Comprehensive evaluations on T2I-Compbench, NSR-1K, and Drawbench demonstrate the superior generation performance of our method. Especially, there are improvements of 19%, 13%, and 12% in color, shape, and texture metrics for T2I-Compbench. Additionally, Q-Align metrics demonstrate that our method can generate images of higher quality.
Primary Subject Area: [Generation] Generative Multimedia
Secondary Subject Area: [Content] Vision and Language, [Experience] Multimedia Applications, [Content] Multimodal Fusion
Relevance To Conference: Our work considerably contributes to the multimedia/multimodal processing by addressing the challenging Text-Image Alignment problem in text-to-image synthesis models.
We propose an innovative Grounded jOint lAyout aLignment (GOAL), an effective fine-tuning framework that incorporates discriminative semantic alignment (DSAlign) and masked attention alignment (MAAlign) as auxiliary training objectives. By incorporating these advancements, the framework not only improves the accuracy of text-to-image alignment but also enhances the visual quality of generated images.
Furthermore, our work introduces the multimodel GOAL2K dataset, a curated collection of over 2000 annotated image-text pairs. This dataset serves as a valuable resource for training and fine-tuning text-to-image synthesis models. The availability of this dataset is expected to accelerate research progress in the field by providing researchers with standardized data for experimentation.
The comprehensive evaluation conducted across benchmark datasets demonstrates significant performance improvements, particularly in key metrics such as color, shape, and texture. Additionally, the validation through Q-Align confirms the high-quality of the generated images produced by the proposed method.
In summary, the contributions of our work advance the state-of-the-art in text-to-image generation, offering valuable insights for researchers in the multimedia/multimodal community to develop more accurate, reliable, and visually appealing image synthesis models.
Supplementary Material: zip
Submission Number: 3233
Loading