T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation

Published: 24 Mar 2026, Last Modified: 24 Mar 2026CVPR 2026 Workshop VGBEEveryoneRevisionsBibTeXCC BY 4.0
Submission Type: Short Papers (up to 4 pages)
Keywords: Text-to-audio-video Generation Benchmark, Evaluation Benchmark
TL;DR: We introduce T2AV-Compass for evaluating text-to-audio-video generation and find 15 leading systems still fall short, especially on audio realism, synchronization, and instruction following.
Abstract: Text-to-Audio-Video (T2AV) generation aims to synthesize temporally coherent video and semantically synchronized audio from natural language. However, its evaluation remains fragmented, often relying on unimodal metrics or narrow benchmarks that fail to capture cross-modal alignment, instruction following, and perceptual realism. To address this limitation, we present T2AV-Compass, a unified benchmark for comprehensive evaluation of T2AV systems. It consists of 500 diverse, complex prompts constructed via a taxonomy-driven pipeline to ensure semantic richness and physical plausibility. T2AV-Compass further introduces a dual-level evaluation framework that combines objective signal-level metrics with a subjective, MLLM-based protocol for instruction following and realism assessment. Extensive evaluation of 15 representative T2AV systems shows that even the strongest models still fall substantially short of human-level cross-modal consistency, with persistent failures in audio realism and fine-grained synchronization. These results position T2AV-Compass as a challenging diagnostic testbed for advancing multimodal generation.
Submission Number: 2
Loading