SRVC-LA: Sparse regularization of visual context and latent attention based model for video description

Published: 01 Jan 2025, Last Modified: 11 Apr 2025Neurocomputing 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Video description is one of the important generative tasks, which is to summarize the visual content and translate it into natural language. Nowadays, an increasing number of effective models have been developed for this task. Nevertheless, the visual and language features are combined and represented in a dense multi-modal feature space in popular works, making it easy for the model to become stuck in over fitting. This results in the model lacking sufficient generalization ability. A model with sparse regularization of visual context and latent attention (SRVC-LA) is proposed in this work. According to encoding padding sequence corresponding to the encoded video and then concatenation of visual contextual features, the visual regularization context is extracted for vision attention sparsity. Then it is attended by a latent attention mechanism, where the visual regularization context and the previous hidden state is combined and attended for multi-modal semantic alignment. Additionally, the visual and language features are combined with their respective latent attention features and fed to two branches for semantic compensation. Experiments on MSVD and MSR-VTT2016 datasets are conducted, and better performances are achieved compared to the baseline and other popular models, demonstrating the effectiveness and superiority of the proposed model.
Loading