Cross-modal Embeddings for Video and Audio RetrievalOpen Website

2018 (modified: 11 Nov 2022)ECCV Workshops (4) 2018Readers: Everyone
Abstract: In this work, we explore the multi-modal information provided by the Youtube-8M dataset by projecting the audio and visual features into a common feature space, to obtain joint audio-visual embeddings. These links are used to retrieve audio samples that fit well to a given silent video, and also to retrieve images that match a given query audio. The results in terms of Recall@K obtained over a subset of YouTube-8M videos show the potential of this unsupervised approach for cross-modal feature learning.
0 Replies

Loading