Multi-Granularity Feature Fusion for Image-Guided Story Ending Generation

Published: 01 Jan 2024, Last Modified: 14 Nov 2024IEEE ACM Trans. Audio Speech Lang. Process. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Image-guided Story Ending Generation aims at generating a reasonable and logical ending given a story context and an ending-related image. The existing models have achieved some success by fusing global image features with story context through an attention mechanism. However, they ignore the logical relationship between the story context and the image regions, and have not considered the high-level semantic features of the image such as visual sentiment. This may cause the generated ending inconsistent with the logic or sentiment of the given information. In this paper, we propose a M ulti- G ranularity feature F usion (MGF) model to solve this problem. Concretely, we first employ an image sentiment extractor to grasp the sentiment features of the image as part of the global image features. We then design a scene subgraph selector to capture the image features of the key region by picking the scene subgraph most relevant to the context. Finally, we fuse the textual and visual features from object level, region level, and global level, respectively. Our model is thereby capable of effectively capturing the key region features and visual sentiment of the image, so as to generate a more logical and sentimental ending. Experimental results show that our MGF model outperforms the state-of-the-art models on most metrics.
Loading