Identify, Locate and Separate: Audio-Visual Object Extraction in Large Video Collections Using Weak Supervision

Published: 01 Jan 2019, Last Modified: 15 Nov 2024WASPAA 2019EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: We tackle the problem of audio-visual scene analysis for weakly-labeled data. To this end, we build upon our previous audio-visual representation learning framework to perform object classification in noisy acoustic environments and integrate audio source enhancement capability. This is made possible by a novel use of non-negative matrix factorization for the audio modality. Our approach is founded on the multiple instance learning paradigm. Its effectiveness is established through experiments over a challenging dataset of music instrument performance videos. We also show encouraging visual object localization results.
Loading