Cross-modal Embeddings for Video and Audio Retrieval

Didac Surís, Amanda Duarte, Amaia Salvador, Jordi Torres, Xavier Giró i Nieto

2018 (modified: 11 Nov 2022)ECCV Workshops (4) 2018Readers: Everyone

Abstract: In this work, we explore the multi-modal information provided by the Youtube-8M dataset by projecting the audio and visual features into a common feature space, to obtain joint audio-visual embeddings. These links are used to retrieve audio samples that fit well to a given silent video, and also to retrieve images that match a given query audio. The results in terms of Recall@K obtained over a subset of YouTube-8M videos show the potential of this unsupervised approach for cross-modal feature learning.

0 Replies