Abstract: This work introduces Dialogue-AV, a benchmarking dataset for Audio-Video-Language (AVL). We propose using dialogue to describe video content instead of single captions, capturing nuances and shared meanings between audio and visual elements. This approach contributes significantly to improving the diversity of video descriptions and enables comprehensive evaluation of AVL learning across different downstream tasks, such as Cross-Modal Retrieval, Visual Question-Answering, and Video Captioning. Our dataset comprises approximately 258k audiovisual samples accompanied by dialogue-based descriptions for benchmarking. Dialogue-AV builds upon existing State-of-the-Art (SOTA) datasets that feature human-generated descriptions, enhancing them with model-generated ones that describe all modalities. We also present zero-shot baseline results utilising SOTA Visual-Language Models (VLMs), demonstrating that Dialogue-AV is capable of benchmarking a variety of downstream tasks with diverse inputs. Our key contributions include: 1) Dialogue-AV, a benchmark dataset for dialogue-based AVL models; and 2) benchmarks that expose the limitations of current SOTA VLMs. The code and dataset are accessible at: github.com/lvilaca16/dialogue-av.
External IDs:dblp:conf/cbmi/VilacaVY25
Loading