Identify, Locate and Separate: Audio-Visual Object Extraction in Large Video Collections Using Weak Supervision

Sanjeel Parekh, Alexey Ozerov, Slim Essid, Ngoc Q. K. Duong, Patrick Pérez, Gaël Richard

Published: 2019, Last Modified: 15 Nov 2024WASPAA 2019EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: We tackle the problem of audio-visual scene analysis for weakly-labeled data. To this end, we build upon our previous audio-visual representation learning framework to perform object classification in noisy acoustic environments and integrate audio source enhancement capability. This is made possible by a novel use of non-negative matrix factorization for the audio modality. Our approach is founded on the multiple instance learning paradigm. Its effectiveness is established through experiments over a challenging dataset of music instrument performance videos. We also show encouraging visual object localization results.