Dissecting Temporal Understanding in Text-to-Audio Retrieval

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Recent advancements in machine learning have fueled research on multimodal interactions, such as for instance text-to-video and text-to-audio retrieval tasks. These tasks require models to understand the semantic content of input videos, including objects, sounds and characters. The models also need to learn their spatial arrangement and the temporal relationships of sounds. In this work, we tackle the temporal ordering of sounds, which is an understudied problem in the context of text-to-audio retrieval. In particular, we dissect the temporal understanding capabilities of a state-of-the-art model for text-to-audio retrieval on the AudioCaps dataset. Additionally, we introduce a synthetic text-audio dataset that provides a controlled setting for evaluating the temporal understanding of recent models. Lastly, we investigate a new loss function that encourages text-audio models to focus on the temporal ordering of events.
Relevance To Conference: This work analyses the semantic understanding in between the text and audio modalities. More specifically, it investigates the task of free-form text-to-audio retrieval. It looks at why the training data we have and the benchmarks we have might not be good enough for models to properly understand temporal ordering. It proposes a new multimodal synthetic dataset containing text descriptions and audio files. This is important as it can improve search results given a user text query and an labelled/unlabelled database of audio files.
Supplementary Material: zip
Primary Subject Area: [Engagement] Multimedia Search and Recommendation
Secondary Subject Area: [Content] Multimodal Fusion, [Experience] Multimedia Applications
Submission Number: 5636
Loading