Descriptive and Coherent Paragraph Generation for Image Paragraph Captioning Using Vision Transformer and Post-processing

Published: 01 Jan 2023, Last Modified: 20 May 2025ACIVS 2023EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: The task of visual paragraph generation involves generating a descriptive and coherent paragraph based on an input image. The current state-of-the-art approaches use Regions of Interest (RoI) identification to generate paragraphs. The proposed approach eliminates the need for RoI identification. A transformer-based encoder-decoder model is used for paragraph generation. A post-processing step is introduced to enhance the semantic relevance of the generated paragraphs. This is achieved by incorporating the image-text similarity scores and related-classes similarity scores. The results of our studies demonstrate that the proposed model generates paragraphs with improved coherence and a higher Flesch reading ease score.
Loading