Abstract: Multi-modal learning, particularly among imaging and
linguistic modalities, has made amazing strides in many
high-level fundamental visual understanding problems,
ranging from language grounding to dense event captioning. However, much of the research has been limited to
approaches that either do not take audio corresponding to
video into account at all, or those that model the audiovisual correlations in service of sound or sound source localization. In this paper, we present the evidence, that audio
signals can carry surprising amount of information when
it comes to high-level visual-lingual tasks. Specifically,
we focus on the problem of weakly-supervised dense event
captioning in videos and show that audio on its own can
nearly rival performance of a state-of-the-art visual model
and, combined with video, can improve on the state-of-theart performance. Extensive experiments on the ActivityNet
Captions dataset show that our proposed multi-modal approach outperforms state-of-the-art unimodal methods, as
well as validate specific feature representation and architecture design choices.
0 Replies
Loading