iKnow-audio: Integrating Knowledge Graphs with Audio-Language Models

iKnow-audio: Integrating Knowledge Graphs with Audio-Language Models

ACL ARR 2025 May Submission6721 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Contrastive audio-language models are learned by semantically aligning different modalities in a shared embedding space. Existing research shows that zero-shot classification performance is sensitive to language nuances and prompt formulation. In addition, learned artifacts and spurious correlations from noisy pretraining often lead to semantic ambiguity in label interpretation. While recent work has explored few-shot prefix tuning methods, adapters, and prompt engineering strategies to mitigate these issues, the use of structured prior knowledge remains largely unexplored. In this work, we enhance CLAP predictions using structured reasoning over a knowledge graph (KG). We construct a large, audio-centric KG that encodes ontological relations comprising semantical, causal, and taxonomic connections reflective of everyday sound scenes and events. A systematic analysis of retrieval performance across major publicly available audio collections demonstrates that symbolic knowledge enables robust semantic grounding for contrastive audio-language models. This improvement is further supported by embedding visualizations of CLAP before and after incorporating the KG.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: Information Extraction, Resources and Evaluation, Multimodality and Language Grounding to Vision, Robotics and Beyond,

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources, Data analysis

Languages Studied: English

Submission Number: 6721

Loading