Metrics for Fine-Grained Evaluation of Inline Audio Descriptions

Published: 28 Aug 2025, Last Modified: 29 Aug 2025CV4A11yEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Audio description, metrics, automated evaluation, VLMs
TL;DR: We propose fine-grained evaluation metrics for assessing the quality of audio descriptions and evaluate VLM generated comparing with human generated AD.
Abstract: Audio descriptions that describe visual content in audio make videos accessible to people who can not see them. While humans create audio descriptions for film and television, audio descriptions remain largely unavailable for the vast majority of user-generated videos online. Now, publicly accessible Multimodal Large Language Models (MLLMs) can create audio descriptions on-demand. However, the quality of such descriptions remains unknown, especially for user-generated videos. We propose fine-grained evaluation metrics for assessing the quality of audio descriptions derived from audio description expert guidelines and a user study. We collect expert human descriptions for 400 user-generated videos, and then use our proposed metrics to compare the quality of MLLM-generated to human-created descriptions on 60 videos. We synthesize remaining gaps between human-created descriptions and MLLM-generated descriptions using our metrics and qualitative analysis.
Submission Number: 21
Loading