Abstract: Audio-Visual Event Localization (AVEL) is the task of temporally localizing and classifying \emph{audio-visual events}, i.e., events simultaneously visible and audible in a video. In this paper, we solve AVEL in a weakly-supervised setting, where only video-level event labels (their presence/absence, but not their locations in time) are available as supervision for training. Our idea is to use a base model to estimate pseudo labels on the training data at a finer temporal resolution than at the video level (``label-refinement'') and then re-train the model with these new labels. In label-refinement, we estimate the subset of labels for each \emph{slice} of frames in a training video by (i) replacing the frames outside the slice with those from a second video having no overlap in video-level labels, and (ii) feeding this synthetic video into the base model to extract labels for just the slice in question. To handle the out-of-distribution nature of our synthetic videos, we propose an auxiliary objective to train the base model that induces more reliable predictions of the localized event labels as desired. Our three-stage pipeline outperforms several existing AVEL methods with no architectural changes and improves performance on a related weakly-supervised task as well. We also find that the evaluation of existing AVEL methods has been seriously misleading and therefore propose new metrics for a better sense of performance.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Antoni_B._Chan1
Submission Number: 1716
Loading