Keywords: multi-agent, quality perception, multimodal large language model
Abstract: In this paper, we propose **XGC-AVis**, a multi-agent framework that enhances the audio-video temporal alignment capabilities of multimodal large models (MLLMs) and improves the efficiency of retrieving key video segments through $4$ stages: perception, planning, execution, and reflection. We further introduce **XGC-AVQuiz**, the first benchmark aimed at comprehensively assessing MLLMs' understanding capabilities in both real-world and AI-generated scenarios. XGC-AVQuiz consists of $2,685$ question-answer pairs across $20$ tasks, with two key innovations: 1) **AIGC Scenario Expansion:** The benchmark includes $2,232$ videos, comprising $1,102$ professionally generated content (PGC), $753$ user-generated content (UGC), and $377$ AI-generated content (AIGC). These videos cover $10$ major domains and $53$ fine-grained categories. 2) **Quality Perception Dimension:** Beyond conventional tasks such as recognition, localization, and reasoning, we introduce a novel quality perception dimension. This requires MLLMs to integrate low-level sensory capabilities with high-level semantic understanding to assess audio-visual quality, synchronization, and coherence. Experimental results on XGC-AVQuiz demonstrate that current MLLMs struggle with quality perception and temporal alignment tasks. XGC-AVis improves these capabilities without requiring additional training, as validated on two benchmarks. The project page is available at: https://xgc-avis.github.io/XGC-AVis/
Primary Area: datasets and benchmarks
Submission Number: 18017
Loading