OmniDialog: A Multimodal Benchmark for Generalization Across Text, Visual, and Audio Modalities

Published: 28 Nov 2024, Last Modified: 04 Mar 2025The 2nd GenBench Workshop on Generalisation (Benchmarking) in NLPEveryoneCC BY 4.0
Abstract: We introduce *OmniDialog* — the first trimodal comprehensive benchmark grounded in a knowledge graph (Wikidata) to evaluate the generalization of Large Multimodal Models (LMMs) across three modalities. Our benchmark consists of more than 4,000 dialogues, each averaging 10 turns, all annotated and cross-validated by human experts. The dialogues in our dataset are designed to prevent shortcut learning by incorporating various formats and misleading or irrelevant multimodal cues. We also evaluate both multimodal and unimodal models to gain insights into how they process modality inputs introduced in the conversation.
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview