Multimodal Attention for Fusion of Audio and Spatiotemporal Features for Video DescriptionDownload PDFOpen Website

2018 (modified: 10 Nov 2022)CVPR Workshops 2018Readers: Everyone
Abstract: We incorporate audio features, in addition to image and motion features, for video description based on encoder-decoder recurrent neural networks (RNNs). To fuse these modalities, we introduce a multimodal attention model that can selectively utilize features from different modalities for each word in the output description. We apply our new framework for video description using state-of-the-art audio features such as SoundNet and Audio set VGGish, and state-of-the-art image and spatiotemporal features such as I3D. Results confirm that our attention-based multi-modal fusion of audio features with visual features outperforms conventional video description approaches on three datasets.
0 Replies

Loading