Keywords: Omni-models, MLLM, Foundation Model, LLM, OLM
TL;DR: A challenging tri-modal reasoning benchmark for evaluating omni-language models.
Abstract: Recent advancements in multimodal large language models (MLLMs) have focused on integrating multiple modalities, yet their ability to simultaneously process and reason across different inputs remains underexplored. We introduce OmniBench, a novel benchmark designed to evaluate models’ ability to recognize, interpret, and reason across visual, acoustic, and textual inputs simultaneously. We define language models capable of such tri-modal processing as omni-language models (OLMs). OmniBench features high-quality human annotations that require integrated understanding across all modalities. Our evaluation reveals that: i) open-source OLMs show significant limitations in instruction-following and reasoning in tri-modal contexts; and ii) most baseline models perform poorly (below 50% accuracy) even with textual alternatives to image/audio inputs. To address these limitations, we develop OmniInstruct, an 96K-sample instruction tuning dataset for training OLMs. We advocate for developing more robust tri-modal integration techniques and training strategies to enhance OLM performance. Codes and data could be found at https://m-a-p.ai/OmniBench/.
Croissant File: json
Dataset URL: https://huggingface.co/datasets/m-a-p/OmniBench
Code URL: https://github.com/multimodal-art-projection/OmniBench
Primary Area: Datasets & Benchmarks for applications in language modeling and vision language modeling
Submission Number: 1648
Loading