ACES: Evaluating Automated Audio Captioning Models on the Semantics of Sounds

Published: 01 Jan 2023, Last Modified: 15 May 2025EUSIPCO 2023EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Automated Audio Captioning is a multimodal task that aims to convert audio content into natural language. The performance of audio captioning systems is evaluated on quantitative metrics applied to the text representations. Previously, researchers have applied metrics from machine translation and image captioning to evaluate a generated audio caption. Inspired by cognitive neuroscience research on auditory cognition, in this paper we present a novel metric approach that evaluates captions taking into account how human listeners derive semantic information from sounds: Audio Captioning Evaluation on Semantics of Sound (ACES).
Loading