A Crossmodal Approach to Multimodal Fusion in Video Hyperlinking

Vedran Vukotic, Christian Raymond, Guillaume Gravier

07 Apr 2022OpenReview Archive Direct UploadReaders: Everyone

Abstract: With the recent resurgence of neural networks and the proliferation of massive amounts of multimodal unlabeled data, recommendation systems and multimodal retrieval systems based on continuous representation spaces and deep learning methods are becoming of great interest. In this work, we present a method to perform high-level multimodal fusion by focusing on crossmodal translation by means of symmetrical encoders cast into a bidirectional deep neural network (BiDNN). We analyze different continuous single-modal representations and evaluate BiDNNs in a multimodal retrieval setup. Using the notions learnt from multimodal retrieval we craft a system based on BiDNNs to perform video hyperlinking and recommend interesting video segments to a viewer. Results established within the TRECVID's 2016 video hyperlinking benchmarking initiative show that our method obtained the best score, thus defining the state of the art.

0 Replies