Keywords: SAR, vision-language alignment, Earth observation, ground-photo supervision, zero-shot remote sensing
TL;DR: A single projection head that makes SAR data text-queryable and unlocks closed multi-sensor encoders like AlphaEarth.
Abstract: Synthetic Aperture Radar (SAR) backscatter does not resemble ground appearance, so optical vision-language models do not generalise to it; the recent SAR-text corpora that have begun to fill the gap (SARChat-2M, SARLang-1M, SAR-TEXT, FSAR-Cap) still require SAR-specific text construction — via templates, detection-label expansion, or automated narrators. GLUE-Link bypasses that requirement entirely. A single 1x1 linear projection maps any frozen satellite encoder into the SigLIP-2 ground-photograph embedding space, supervised only by geographic co-location with LUCAS field photos — no labels, no captions, no SAR-language pairs. With a pure SAR encoder (SSL4EO-S1) the linked representation reaches 45.4% zero-shot top-1 on 8-class LUCAS land cover and Spearman ρ=0.890 caption agreement, matching the multispectral MS-CLIP (45.7%, ρ=0.549) and approaching the native optical model from a concurrent companion work (50.7%, ρ=0.912). Any new satellite encoder — including closed multi-sensor models such as AlphaEarth — can be made text-queryable in one linear training step on existing ground-photo surveys.
Submission Number: 8
Loading