Abstract: The movie description task aims to generate narrative textual descriptions that match the content of the movie. Most of the current methods lack the ability to consider comprehensive visual content analysis and contextual information utilization simultaneously, resulting in inaccurate or incoherent in the generated descriptions. In order to tackle the problem, we propose a new method called spatial-temporal contextual feature fusion network (ST-CFFNet) to capture both spatial-temporal and contextual information in movie by building the stacked visual graph attention encoding unit and the contextual feature fusion module. We also propose a spatial-temporal context loss to constrain the effectiveness of ST-CFFNet in spatial-temporal relation analysis and context modeling. The experimental results on LSMDC dataset show that our method achieves more accurate and coherent movie descriptions.
Loading