Abstract: Contrastive audio-language models are learned by semantically aligning different modalities in a shared embedding space. Existing research shows that zero-shot classification performance is sensitive to language nuances and prompt formulation. In addition, learned artifacts and spurious correlations from noisy pretraining often lead to semantic ambiguity in label interpretation. While recent work has explored few-shot prefix tuning methods, adapters, and prompt engineering strategies to mitigate these issues, the use of structured prior knowledge remains largely unexplored. In this work, we enhance CLAP predictions using structured reasoning over a knowledge graph (KG). We construct a large, audio-centric KG that encodes ontological relations comprising semantical, causal, and taxonomic connections reflective of everyday sound scenes and events. A systematic analysis of retrieval performance across major publicly available audio collections demonstrates that symbolic knowledge enables robust semantic grounding for contrastive audio-language models. This improvement is further supported by embedding visualizations of CLAP before and after incorporating the KG.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: Information Extraction, Resources and Evaluation, Multimodality and Language Grounding to Vision, Robotics and Beyond,
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources, Data analysis
Languages Studied: English
Submission Number: 6721
Loading