Abstract: Audio description (AD) plays a crucial role in making video content accessible to visually impaired audiences, yet current approaches often rely on expensive supervised training or struggle to capture temporal and narrative consistency. We introduce a training-free framework that integrates vision–language models (VLMs) with large language models (LLMs) through three complementary mechanisms: semantic-constrained prompting to reduce irrelevant content, adaptive character reasoning for accurate entity grounding, and a memory structure that aligns fine-grained shot-level cues with longer scene-level context. This design allows the system to generate temporally coherent and context-aware AD without requiring additional training data. Evaluation on the MAD-eval-Named and TV-AD benchmarks demonstrates consistent improvements over state-of-the-art training-free methods, with gains in both lexical and semantic quality metrics.
External IDs:dblp:conf/isvc/MocanuT25
Loading