Semantic Memory Guided Diffusion Networks for Image-to-Long Text Generation

18 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: representation learning for computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Diffusion Model, Semantic Guidance, Image-to-Long Text Generation
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: We propose a diffusion-based solution and a new dataset for image-to-long text generation.
Abstract: Automatic describing image with comprehensive textual content is often demanded by different real-world applications, which motivates image-to-text generation tasks such as image captioning. However, conventional tasks mainly focus on generating short text, which often fail to deal with challenging scenarios that long text is inevitable required to describe enriched and diversified visual contents. Therefore, a more generic solution, which should be able to generate text with arbitrary length (long text in most cases), is expected to overcome limitations from existing approaches such as inability to generate sufficiently comprehensive and complete textual content and ensure semantic coherence in it. To address such limitations, we propose a dedicated solution, semantic memory guided diffusion networks (SeMDiff), for image-to-long text generation (I2LTG), which explicitly captures salient semantics from the visual contents, and further record and calibrate them by memory networks to facilitate the text generation process. Specifically, we employ semantic concepts as the vehicle to deliver and process semantics embedded in images, where they are predicted from each image and enhanced in memory, then serve as the condition to guide diffusion networks for iterative generation. Experimental results on three public datasets and a new proposed one with more than 54K instances demonstrate the superiority of our approach compared to previous state-of-the-art solutions. Further analyses illustrate that our approach offers an effective diffusion-based solution with external guidance for long text generation under different cross-modal settings.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 1068
Loading