EarthMind: Leveraging Cross-Sensor Data for Advanced Earth Observation Interpretation with a Unified Multimodal LLM
Keywords: Earth Observation, Cross-sensor fusion, Multimodal LLMs
Abstract: Earth Observation (EO) data analysis is vital for monitoring environmental and human dynamics. Recent Multimodal Large Language Models (MLLMs) show potential in EO understanding but remain restricted to single-sensor inputs, overlooking the complementarity across heterogeneous modalities. We propose EarthMind, *a unified vision-language framework* that handles both *single- and cross-sensor* inputs via an innovative hierarchical cross-modal attention (*i.e.*, HCA) design. Specifically, HCA hierarchically captures visual relationships across sensors and aligns them with language queries, enabling adaptive fusion of optical and Synthetic Aperture Radar (SAR) features. To support cross-sensor learning, we curate *FusionEO*, a 30K-pair dataset with diverse annotations, and establish *EarthMind-Bench*, a 2,841-pair benchmark with expert annotations for perception and reasoning tasks. Extensive experiments show that EarthMind achieves state-of-the-art results on EarthMind-Bench and surpasses existing MLLMs on multiple EO benchmarks.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 13086
Loading