CLAIRA: Leveraging Large Language Models to Judge Audio Captions

Tsung-Han Wu; Joseph E. Gonzalez; Trevor Darrell; David M. Chan

CLAIRA: Leveraging Large Language Models to Judge Audio Captions

Tsung-Han Wu, Joseph E. Gonzalez, Trevor Darrell, David M. Chan

Published: 26 Aug 2025, Last Modified: 26 Aug 2025SpeechAI TTIC 2025 OralorPosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Audio Captioning, Evaluation Metrics, Language Models, Auditory Scene Understanding

TL;DR: We validate that large language models can be used as a judge for audio captioning, outperforming existing metrics when correlating to human preference.

Presentation Preference: Open to it if recommended by organizers

Abstract: Automated Audio Captioning (AAC) aims to generate natural language descriptions of audio. Evaluating these machine-generated captions is a complex task, demanding an understanding of audio-scenes, sound-object recognition, temporal coherence, and environmental context. While existing methods focus on a subset of such capabilities, they often fail to provide a comprehensive score aligning with human judgment. Here, we introduce CLAIR-A, a simple and flexible approach that uses large language models (LLMs) in a zero-shot manner to produce a "semantic distance" score for captions. In our experiments, CLAIR-A more closely matches human ratings than other metrics, outperforming the domain-specific FENSE metric by 5.8\% and surpassing the best general-purpose measure by up to 11\% on the Clotho-Eval dataset. Moreover, CLAIR-A allows the LLM to explain its scoring, with these explanations rated up to 30\% better by human evaluators than those from baseline methods.

Submission Number: 34

Loading