Automatic Audio Description: A Training-Free Approach Using Foundation Models

Ruxandra Tapu, Bogdan Mocanu

Published: 2025, Last Modified: 01 Mar 2026CAIP (2) 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: In this paper, we propose a training-free framework for generating audio descriptions (ADs) by leveraging large pretrained Video-Language Models (VLMs) and Large Language Models (LLMs) without task-specific fine-tuning. Our method enhances video understanding through a semantic-constrained prompting strategy that incorporates temporally coherent context into VLM inputs, while an adaptive character recognition module ensures consistent identity tracking across frames. By explicitly linking visual character observations to narrative elements, the system produces contextually rich and coherent visual descriptions. Finally, the video captions are then refined into a single, concise audio description sentence through a LLM operating exclusively on text inputs, ensuring clarity, brevity, and narrative cohesion.

External IDs:dblp:conf/caip/TapuM25