Graph-Score: A Graph-grounded Metric for Audio Captioning

ACL ARR 2024 August Submission104 Authors

14 Aug 2024 (modified: 13 Mar 2025)ACL ARR 2024 August SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Evaluating audio captioning systems is a challenging problem since the evaluation process must consider numerous semantic alignments of candidate captions, such as sound event matching and the temporal relationship among them. The existing metrics failed to take these alignments into account as they consider either statistical overlap (BLEU, SPICE, CIDEr) or latent representation similarity (FENSE). To tackle the aforementioned issues of the current metrics, we propose the graph-score, which grounds audio captions to semantic graphs, for better measuring the performance of AAC systems. Our proposed metric achieves the highest agreement with human judgment on the pairwise benchmark datasets. Furthermore, we contribute high-quality benchmark datasets to make progress in developing evaluation metrics for the audio captioning task.
Paper Type: Short
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: Optimal transport, audio captioning, automated metric
Contribution Types: NLP engineering experiment, Data resources
Languages Studied: English
Submission Number: 104
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview