Seeing Through Words: A Zero-Shot Multimodal Audio Description System with Foundation Models

Bogdan Mocanu, Ruxandra Tapu

Published: 2025, Last Modified: 01 Mar 2026ISVC (2) 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Audio description (AD) plays a crucial role in making video content accessible to visually impaired audiences, yet current approaches often rely on expensive supervised training or struggle to capture temporal and narrative consistency. We introduce a training-free framework that integrates vision–language models (VLMs) with large language models (LLMs) through three complementary mechanisms: semantic-constrained prompting to reduce irrelevant content, adaptive character reasoning for accurate entity grounding, and a memory structure that aligns fine-grained shot-level cues with longer scene-level context. This design allows the system to generate temporally coherent and context-aware AD without requiring additional training data. Evaluation on the MAD-eval-Named and TV-AD benchmarks demonstrates consistent improvements over state-of-the-art training-free methods, with gains in both lexical and semantic quality metrics.

External IDs:dblp:conf/isvc/MocanuT25