Abstract: Video captioning is a challenging cross-modal task that requires taking full advantage of both vision and language. To identify objects in videos, object detectors are usually employed to extract high-level object-related features, but the fine-grained knowledge from the object detectors are often neglected. Also, there is a fact that not just the task of object detection has the ability to obtain additional knowledge for video understanding. In this paper, multiple tasks are assigned to fully mine multi-concept knowledge in both vision and language, including video-to-video knowledge, video-to-text knowledge and text-to-text knowledge. Moreover, since there is a strong synergy in knowledge, both of global and local word similarities are developed based on the text-to-text knowledge to boost the robustness of the mined semantic knowledge. The mined knowledge can offer the model an extra guidance apart from linguistic prior to generate more semantically appropriate and grammatically correct sentences. The experimental results on the benchmark MSVD and MSR-VTT datasets show that the proposed method makes remarkable improvement on all metrics on MSVD and two out of four metrics on MSR-VTT.
Loading