OmniEarth-Bench: Towards Holistic Evaluation of Earth’s Six Spheres and Cross-Spheres Interactions with Multimodal Observational Earth Data
Keywords: Earth spheres, Observation Earth Data, MLLM, Benchmark
TL;DR: The first comprehensive MLLM benchmark spanning all six Earth science spheres (atmosphere, lithosphere, Oceansphere, cryosphere, biosphere and Human-activities sphere) and cross-spheres with one hundred expert-curated evaluation dimensions.
Abstract: Existing benchmarks for Earth science multimodal learning exhibit critical limitations in systematic coverage of geosystem components and cross-sphere interactions, often constrained to isolated subsystems (only in Human-activities sphere or atmosphere) with limited evaluation dimensions ($\leq$ 16 tasks).
To address these gaps, we introduce **OmniEarth-Bench**, the first comprehensive multimodal benchmark spanning all six Earth science spheres (atmosphere, lithosphere, Oceansphere, cryosphere, biosphere and Human-activities sphere) and cross-spheres with one hundred expert-curated evaluation dimensions.
Leveraging observational data from satellite sensors and in-situ measurements, OmniEarth-Bench integrates 29,779 annotations across four tiers: perception, general reasoning, expert‑knowledge deductive reasoning and chain-of-thought (CoT) reasoning.
This involves the efforts of 2–5 experts per sphere to establish authoritative evaluation dimensions and curate relevant observational datasets, 40 crowd-sourcing annotators to assist experts for annotations and finally OmniEarth-Bench is validated via hybrid expert-crowd workflows to reduce label ambiguity.
Experiments on 9 state-of-the-art MLLMs reveal that even the most advanced models struggle with our benchmarks, where none of them reach 35\% accuracy. Even in some cross-spheres tasks, GPT-4o score 0.0\%.
OmniEarth-Bench sets a new standard for geosystem-aware AI, advancing both scientific discovery and practical applications in environmental monitoring and disaster prediction.
Croissant File: json
Dataset URL: https://huggingface.co/datasets/initiacms/OmniEarth-Bench
Supplementary Material: pdf
Primary Area: Datasets & Benchmarks for applications in language modeling and vision language modeling
Submission Number: 920
Loading