Visual and language semantic hybrid enhancement and complementary for video description

Pengjie Tang, Yunlan Tan, Wenlang Luo

Published: 2022, Last Modified: 11 Apr 2025Neural Comput. Appl. 2022EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: It is a fundamental task of computer vision to describe and express the visual content of a video in natural language, which not only highly summarizes the video, but also presents the visual information in description sentence with reasonable pattern, correct grammars and decent words. The task has wide potential application in early education, visual aids, automatic interpretation and human–machine environment development. Nowadays, there are a variety of effective models for video description with the help of deep learning. However, the visual or language semantics is frequently mined alone, and the visual and language information cannot be complemented each other, resulting in that the accuracy and semantics of the generated sentence are difficult to be further improved. Facing the challenge, a framework for video description with visual and language semantic hybrid enhancing and complementary is proposed in this work. In detail, the language and visual semantics enhancing branches are integrated with the multimodal feature-based module firstly. Then a multi-objective jointly training strategy is employed for model optimization. Finally, the output probabilities from the three branches are fused with the weighted average for word prediction at each time step. Additionally, the language and visual semantics enhancing-based deep fusion modules are combined together with the same jointly training and sequential probabilities fusion for further performance improving. The experimental results on MSVD and MSR-VTT2016 datasets demonstrate the effectiveness of the proposed models, with the performance of proposed models outperforming the baseline model Deep-Glove (which is denoted as E-LSC for simplification and comparison) greatly and achieving competitive performance compared to the state-of-the-art methods. In particular, the BLEU4 and CIDEr reach 52.4 and 81.5, respectively, on MSVD with the proposed \(\hbox {HE-VLSC}^\#\) model.