Attention-enhanced joint learning network for micro-video venue classification

Published: 2024, Last Modified: 07 Jan 2026Multim. Tools Appl. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Currently, micro-video is a popular form on various multimedia platforms. The venue information of micro-videos is beneficial for venue-related applications, such as personalized location recommendation and venue recognition. However, the performance of micro-video venue classification task is limited in existing works due to the ignorant of the global dependency of features. To this end, an enhanced non-local (ENL) module is devised to improve the expressiveness of features. Furthermore, in this paper an attention-enhanced joint learning model is proposed to generate discriminative venue representations in an end-to-end manner. Such unified model is consisted of normalized NeXtVLAD, ENL module, CNN layer, and context gate. Specifically, the sequential features extracted from multiple modalities are aggregated into compact vectors via parallel NNeXtVLAD modules. In ENL, the interactions between any two positions of the aggregated features are captured to reinforce the valuable information in multiple modalities. Moreover, the enhanced channel information is adaptively added for further feature enhancement. Then, a CNN layer is applied to fuse enhanced features of multiple modalities. In addition, the effective activation function is explored in the CNN layer to achieve better performance. Finally, the context gate method is used to dynamically model the relationships between features and venue categories for prediction. Experimental results on a public dataset reveal that our proposed micro-video venue classification scheme achieves state-of-the-art performance.
Loading