Keywords: Audio-Visual Dialogue; Audio Dialogue; Audio-Visual Understanding
Abstract: Developing Multi-Modal Large Language Models (MLLMs) from "tool-oriented" auxiliary AI to "partner-oriented" interactive AI requires the capacity to act as an active participant in Audio-Visual Dialogue (AVD)—a role that blends high **Visually-Grounded IQ** (the capacity to reason over joint audio-visual information) and high **Interactive EQ** (the ability to respond with empathy and expressiveness). However, progress in this area faces a key obstacle: the absence of a unified standard for defining and evaluating a "good" AVD model. The current evaluation landscape is thus fragmented: audio dialogue benchmarks can assess Interactive EQ but remain visually blind, while audio-visual understanding benchmarks evaluate Visually-Grounded IQ but lack interactivity, creating a methodological gap that leaves the critical synthesis of IQ and EQ in conversational contexts entirely unmeasured. To address this gap, we introduce **Partner-Bench**, the first benchmark designed to evaluate this synthesis, and to construct it, we present a novel Data Engine, **Partner-DE**, which automatically mines, filters, and annotates high-quality conversational data from web videos. Partner-Bench comprises 376 samples (3.49 hours total) and features a fine-grained, 7-dimensional evaluation framework: it decomposes IQ into three dimensions (Recognition, Comprehension, and Reasoning) and decouples EQ into two quality categories (Linguistic, including Persona and Cohesion; and Prosodic, including Naturalness and Affect). Our initial experiments on Partner-Bench yield three critical findings: (1) All current models perform significantly below the human baseline, indicating substantial room for improvement; (2) there is a significant performance gap between paradigms, with current SOTA cascaded models significantly outperforming existing end-to-end models (e.g., Cascaded Mimo-Audio at 68.38 vs. Qwen2.5-Omni at 51.53); and (3) a "context cliff" exists, where model performance initially improves with longer context but then sharply declines, revealing a failure to process extended interactions. By providing a rigorous standard and a diagnostic tool to pinpoint such weaknesses, Partner-Bench aims to steer the improvement of AVD models and ultimately accelerate the development of the next generation of truly perceptive and engaging AI companions.
Primary Area: datasets and benchmarks
Submission Number: 2737
Loading