Abstract: The Visual Storytelling Task (VST) extends beyond describing a single image, such as image captioning, to sequential image descriptions in the form of a coherent story. However, such descriptions present challenges in handling varying language styles, relational role-modeling, consistency, and events not evident in individual images. A common limitation in existing approaches is their inability to fully describe relations and visual changes, leading to stories devoid of linguistic cohesion between multiple sentences. To address this challenge, we introduce a novel framework, Sequential Image Storytelling Model (SISM), based on the Transformer architecture. Our model contextualizes input images by dividing them into multiple 16 × 16 patches and associates them with language content using the encoder-decoder technique. It incorporates cross-attention and attention pooling to identify the most relevant relationships between input images. Our proposed model achieves state-of-the-art performance on the recently published SSID dataset and performs competitively on VIST dataset, achieving the top score on BLEU-1 metric.
Loading