UNO-Bench: A Unified Benchmark for Exploring the Compositional Law Between Uni-modal and Omni-modal in OmniModels

UNO-Bench: A Unified Benchmark for Exploring the Compositional Law Between Uni-modal and Omni-modal in OmniModels

ACL ARR 2026 January Submission3555 Authors

04 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal Large Language Models, Omni-modal Evaluation, Benchmark

Abstract: Multimodal Large Languages models have been progressing from uni-modal understanding toward unifying visual, audio and language modalities, collectively termed omni models. However, the correlation between uni-modal and omni-modal remains unclear, which requires comprehensive evaluation to drive omni model's intelligence evolution. In this work, we introduce a novel, high-quality, and UNified Omni model benchmark,UNO-Bench. This benchmark is designed to effectively evaluate bothUNi-modal and Omni-modal capabilities under a unified ability taxonomy, spanning 44 task types and 5 modality combinations. For omni-modal evaluation, we provide 1,250 human-curated samples with 98% cross-modality solvability, which is well-suited to real-world scenarios, particularly within the Chinese context. For uni-modal evaluation, we construct a dataset of 2,480 samples automatically distilled from 18 public benchmarks. This compressed dataset reduces evaluation costs by 90% while maintaining 98% consistency with the full-scale benchmarks. In addition to traditional multi-choice questions, we propose an innovative multi-step open-ended question format to assess complex reasoning. A general scoring model is incorporated, supporting 6 question types for automated evaluation with 95% accuracy. Experimental result shows the Compositional Law between omni-modal and uni-modal performance and the omni-modal capability manifests as a bottleneck effect on weak models, while exhibiting synergistic promotion on strong models. Our code and data are available at GitHub.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: multimodality, speech and vision, benchmarking, evaluation methodologies, vision question answering

Contribution Types: Model analysis & interpretability, Publicly available software and/or pre-trained models, Data resources

Languages Studied: English, Chinese

Submission Number: 3555

Loading