Deep Relationship Analysis in Video with Multimodal Feature Fusion

Fan Yu, Dandan Wang, Beibei Zhang, Tongwei Ren

Published: 01 Jan 2020, Last Modified: 09 May 2023ACM Multimedia 2020Readers: Everyone

Abstract: In this paper, we propose a novel multimodal feature fusion method based on scene segmentation to detect the relationships between entities in a long duration video. Specifically, a long video is split into some scenes and entities in the scenes are tracked. Text, audio and visual features in a scene are extracted to predict relationships between different entities in the scene. The relationships between entities construct a knowledge graph of the video and can be used to answer some queries about the video. The experimental results show that our method performs well for deep video understanding on the HLVU dataset.

0 Replies