DUNIA: Pixel-Sized Embeddings via Cross-Modal Alignment for Earth Observation Applications

Ibrahim Fayad; Max Zimmer; Martin Schwartz; Fabian Gieseke; Philippe CIAIS; Gabriel Belouze; Sarah Brood; Aurélien de Truchis; Alexandre d'Aspremont

DUNIA: Pixel-Sized Embeddings via Cross-Modal Alignment for Earth Observation Applications

Ibrahim Fayad, Max Zimmer, Martin Schwartz, Fabian Gieseke, Philippe CIAIS, Gabriel Belouze, Sarah Brood, Aurélien de Truchis, Alexandre d'Aspremont

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY-NC 4.0

Abstract: Significant efforts have been directed towards adapting self-supervised multimodal learning for Earth observation applications. However, most current methods produce coarse patch-sized embeddings, limiting their effectiveness and integration with other modalities like LiDAR. To close this gap, we present DUNIA, an approach to learn pixel-sized embeddings through cross-modal alignment between images and full-waveform LiDAR data. As the model is trained in a contrastive manner, the embeddings can be directly leveraged in the context of a variety of environmental monitoring tasks in a zero-shot setting. In our experiments, we demonstrate the effectiveness of the embeddings for seven such tasks: canopy height mapping, fractional canopy cover, land cover mapping, tree species identification, plant area index, crop type classification, and per-pixel waveform-based vertical structure mapping. The results show that the embeddings, along with zero-shot classifiers, often outperform specialized supervised models, even in low-data regimes. In the fine-tuning setting, we show strong performances near or better than the state-of-the-art on five out of six tasks.

Lay Summary: Satellite images are invaluable for many essential tasks that support environmental sustainability, such as managing forests, protecting biodiversity, addressing climate change, managing natural resources, and monitoring agriculture. Recently, supervised deep learning models have been proposed for these tasks and have produced impressive results. However, most of these models depend on large amounts of labeled data and are designed for specific applications, limiting their adaptability to new tasks. Self-supervised learning approaches reduce the need for labeled data and offer greater flexibility. Nonetheless, many current methods understand satellite images at a coarse scale. Akin to condensing an entire photograph into a single sentence. This limits their usefulness for fine-grained, per-pixel analysis. We introduce DUNIA, a method that learns detailed, per-pixel representations by combining optical and radar satellite imagery with spaceborne LiDAR data. Instead of relying on labeled examples, DUNIA uses contrastive learning to align different views of the same landscape, capturing both horizontal and vertical patterns. These learned representations can be used directly for a wide range of environmental monitoring tasks. Our results show that DUNIA often matches or exceeds the performance of specialized models, even when labeled data are limited.

Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.

Link To Code: https://github.com/AI4Forest/DUNIA

Primary Area: Applications->Chemistry, Physics, and Earth Sciences

Keywords: earth observation, multi-modality, self-supervised learning, cross-modal retrieval

Submission Number: 12982

Loading