OmniEarth-Bench: Probing Cognitive Abilities of MLLMs for Earth's Multi-sphere Observation Data

17 Sept 2025 (modified: 26 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Earth spheres, Observation Earth Data, MLLM, Benchmark
Abstract: Existing benchmarks for multimodal learning in Earth science offer limited, siloed coverage of Earth’s spheres and their cross-sphere interactions, typically restricting evaluation to the human-activities sphere or atmosphere and to at most 16 tasks. Holistically evaluating MLLMs on observational data across all Earth spheres face three limitation: multi-source heterogeneous data, unlocking scientific formulation and cross-sphere reasoning. Therefore, we introduce OmniEarth-Bench, the first multimodal benchmark that systematically spans all six spheres—atmosphere, lithosphere, oceansphere, cryosphere, biosphere, and human-activity sphere—and cross-sphere. Built with a scalable, modular pipeline that ingests 33 native Earth-observation sources and expert-in-the-loop curation, OmniEarth-Bench produces 29,855 standardized, expert-curated annotations. All annotations are organized into a four-level hierarchy (Sphere, Scenario, Ability, Task), encompassing 109 expert-curated evaluation tasks. Experiments on 9 state-of-the-art MLLMs reveal that even the most advanced models struggle with our benchmarks, where none of them reach 35% accuracy, revealing systematic gaps in Earth-system cognitive ability. The dataset and evaluation code were released.
Primary Area: datasets and benchmarks
Submission Number: 8406
Loading