A Hierarchical Deep Video Understanding Method with Shot-Based Instance Search and Large Language Model

Ruizhe Li, Jiahao Guo, Mingxi Li, Zhengqian Wu, Chao Liang

Published: 2023, Last Modified: 05 Mar 2025ACM Multimedia 2023EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Deep video understanding (DVU) is often considered a challenge due to the aim of interpreting a video with storyline, which is designed to solve two levels of problems: predicting the human interaction in scene-level and identifying the relationship between two entities in movie-level. Based on our understanding of the movie characteristics and analysis of DVU tasks, in this paper, we propose a four-stage method to solve the task, which includes video structuring, shot based instance search, interaction & relation prediction and shot-scene summary & Question Answering (QA) with ChatGPT. In these four stages, shot based instance search allows accurate identification and tracking of characters at an appropriate video granularity. Using ChatGPT in QA, on the one hand, can narrow the answer space, on the other hand, with the help of the powerful text understanding ability, ChatGPT can help us answer the questions by giving background knowledge. We rank first in movie-level group 2 and scene-level group 1, second in movie-level group 1 and scene-level group 2 in ACM MM 2023 Grand Challenge.