TL;DR: An open-vocabulary sound event detection model via logit-corrected frame-wise contrastive objective and large-scale synthetic dataset.
Abstract: Recent multi-modal audio-language models (ALMs) excel at text-audio retrieval but struggle with frame-wise audio understanding. Prior works use temporal-aware labels or unsupervised training to improve frame-wise capabilities, but they still lack fine-grained labeling capability to pinpoint when an event occurs. While traditional sound event detection models can precisely localize events, they are limited to pre-defined categories, making them ineffective for real-world scenarios with out-of-distribution events. In this work, we introduce FLAM, an open-vocabulary contrastive audio-language model capable of localizing specific sound events. FLAM employs a memory-efficient and calibrated frame-wise objective with logit adjustment to address spurious correlations, such as event dependencies and label imbalances during training. To enable frame-wise supervision, we leverage a large-scale dataset with diverse audio events, LLM-generated captions and simulation. Experimental results and case studies demonstrate that FLAM significantly improves the open-vocabulary localization capability while maintaining strong performance in global retrieval and downstream tasks.
Lay Summary: Sound event detection—figuring out when and where certain sounds happen—can greatly enhance how we search, organize, and interact with audio data. However, existing systems are limited by predefined sound categories and struggle to accurately pinpoint the exact timing of diverse events.
Our research tackles this by introducing FLAM, an innovative system that matches audio frames directly to natural language descriptions, making it possible to detect any sound described by a user, even if it wasn't part of the training set. To overcome the challenge of scarce temporal audio annotations, we generated a large, diverse dataset by combining and labeling short sound clips within various backgrounds via data augmentation. We trained FLAM with this data using a specialized objective that corrects biases and enhances precision.
The result is a powerful model capable of accurately locating the timing of sound events in real-time, significantly outperforming previous methods. FLAM opens new opportunities for audio applications like smart search, accessibility tools, and multimedia content analysis, empowering users to intuitively interact with sound data.
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Primary Area: Applications->Everything Else
Keywords: open-vocabulary sound event detection, sound event detection, audio language model, multimodal model, audio representation learning
Submission Number: 11128
Loading