Abstract: In this paper, we propose a unified multi-modal retrieval framework to tackle two typical video understanding tasks, i.e., matching movie scenes and text descriptions, and scene sentiment classification. For the task of matching movie scenes and text descriptions, it is a natural multi-modal retrieval problem, while for the task of scene sentiment classification, the proposed framework aims at finding most related sentiment tag for each movie scene, which is also a multi-modal retrieval problem. By considering these two tasks as multi-modal retrieval problems, we propose a unified multi-modal retrieval framework, which can make full use of the models pre-trained on large scale multi-modal datasets, experiments show that it is critical for the tasks which have only hundreds of training examples. To further improve the performance on movie video understanding task, we also collect a large scale video-text dataset, which contains 427,603 movie-shot and text pairs. Experimental results validate the effectiveness of this dataset.
0 Replies
Loading