Spatial-Temporal Contextual Feature Fusion Network for Movie Description

Published: 01 Jan 2022, Last Modified: 13 Nov 2024CICAI (1) 2022EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: The movie description task aims to generate narrative textual descriptions that match the content of the movie. Most of the current methods lack the ability to consider comprehensive visual content analysis and contextual information utilization simultaneously, resulting in inaccurate or incoherent in the generated descriptions. In order to tackle the problem, we propose a new method called spatial-temporal contextual feature fusion network (ST-CFFNet) to capture both spatial-temporal and contextual information in movie by building the stacked visual graph attention encoding unit and the contextual feature fusion module. We also propose a spatial-temporal context loss to constrain the effectiveness of ST-CFFNet in spatial-temporal relation analysis and context modeling. The experimental results on LSMDC dataset show that our method achieves more accurate and coherent movie descriptions.
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview