\documentclass[pmlr, twocolumn]{jmlr}% new name PMLR (Proceedings of Machine Learning Research)

 % The following packages will be automatically loaded:
 % amsmath, amssymb, natbib, graphicx, url, algorithm2e

\usepackage{longtable}% for long tables
\usepackage{booktabs}
\usepackage[table]{xcolor}
\usepackage{hyperref}
\usepackage{url}
\usepackage{graphicx}
\usepackage{array}
\usepackage{wrapfig}
\usepackage{enumerate}
\usepackage{subcaption}
\usepackage{multirow}
\usepackage{floatrow}
\usepackage{adjustbox}
\usepackage{subcaption}
\usepackage{makecell}   
\usepackage{paralist}
% \usepackage{algpseudocode}
\usepackage{dblfloatfix}
\usepackage{tablefootnote}
\newcommand{\coloredurl}[2][blue]{%
  \href{#2}{\textcolor{#1}{\nolinkurl{#2}}}%
}
\newcolumntype{L}[1]{>{\raggedright\arraybackslash}p{#1}} % fixed-width left
\newcolumntype{C}[1]{>{\centering\arraybackslash}p{#1}}   % fixed-width center
\newcolumntype{X}{>{\raggedright\arraybackslash}X}         % auto width
\usepackage{pifont}     % \ding symbols for ✓ and ✗
\newcommand{\cmark}{\ding{51}}  % check‑mark
\newcommand{\xmark}{\ding{55}}
\usepackage[most]{tcolorbox}
\usepackage{placeins}
 % The siunitx package is used by this sample document
 % to align numbers in a column by their decimal point.
 % Remove the next line if you don't require it.
\usepackage[load-configurations=version-1]{siunitx} % newer version
 %\usepackage{siunitx}
\usepackage[capitalize,noabbrev]{cleveref}
 % The following command is just for this sample document:
\newcommand{\cs}[1]{\texttt{\char`\\#1}}

 % Define an unnumbered theorem just for this sample document:
% \theorembodyfont{\upshape}
% \theoremheaderfont{\scshape}
% \theorempostheader{:}
% \theoremsep{\newline}
% \newtheorem*{note}{Note}

 % change the arguments, as appropriate, in the following:
% \jmlrvolume{1}
% \jmlryear{2025}
% \jmlrworkshop{Proceedings of TerraBytes: Towards global datasets and models
% for Earth Observation Workshop at the 42nd International Conference on
% Machine Learning}
% \jmlrvolume{}      % clear volume
% \jmlryear{}        % clear year
\jmlrproceedings{}{%
  \parbox[t]{\textwidth}{\scriptsize
    Proceedings of TerraBytes: Towards Global Datasets and Models for Earth Observation Workshop\\
    at the 42nd International Conference on Machine Learning
  }%
}
\title[Multi-Modal Inputs Can Improve Data-Efficiency and O.O.D. in ML4EO]{Using Multiple Input Modalities Can Improve Data-Efficiency and O.O.D. Generalization for ML with Satellite Imagery}

% \titlebreak This Title Has
%A Line Break\titletag{\thanks{sample footnote}}}

 % Use \Name{Author Name} to specify the name.

 % Spaces are used to separate forenames from the surname so that
 % the surnames can be picked up for the page header and copyright footer.
 
 % If the surname contains spaces, enclose the surname
 % in braces, e.g. \Name{John {Smith Jones}} similarly
 % if the name has a "von" part, e.g \Name{Jane {de Winter}}.
 % If the first letter in the forenames is a diacritic
 % enclose the diacritic in braces, e.g. \Name{{\'E}louise Smith}

 % *** Make sure there's no spurious space before \nametag ***

 % Two authors with the same address
  % \author{\Name{Arjun Rao\nametag{\thanks{with a note}}} \Email{abc@sample.com}\and
  %  \Name{Author Name2} \Email{xyz@sample.com}\\
  %  \addr Address}

 % Three or more authors with the same address:
 \author{\Name{Arjun Rao} \Email{raoarjun@colorado.edu}\\
  \Name{Esther Rolf} \Email{esther.rolf@colorado.edu}\\
 %  \Name{Author Name3} \Email{an3@sample.com}\\
 %  \Name{Author Name4} \Email{an4@sample.com}\\
 %  \Name{Author Name5} \Email{an5@sample.com}\\
 %  \Name{Author Name6} \Email{an6@sample.com}\\
 %  \Name{Author Name7} \Email{an7@sample.com}\\
 %  \Name{Author Name8} \Email{an8@sample.com}\\
 %  \Name{Author Name9} \Email{an9@sample.com}\\
 %  \Name{Author Name10} \Email{an10@sample.com}\\
 %  \Name{Author Name11} \Email{an11@sample.com}\\
 %  \Name{Author Name12} \Email{an12@sample.com}\\
 %  \Name{Author Name13} \Email{an13@sample.com}\\
 %  \Name{Author Name14} \Email{an14@sample.com}\\
  \addr Department of Computer Science, University of Colorado Boulder}


 % Authors with different addresses:
 % \author{\Name{Author Name1} \Email{abc@sample.com}\\
 % \addr Address 1
 % \AND
 % \Name{Author Name2} \Email{xyz@sample.com}\\
 % \addr Address 2
 %}

%\editor{Editor's name}
 % \editors{List of editors' names}

\begin{document}

\maketitle

\begin{abstract} A large variety of geospatial data layers is available around the world ranging from remotely-sensed raster data like satellite imagery, digital elevation models, predicted land cover maps, and human-annotated data, to data derived from environmental sensors such as air temperature or wind speed data. A large majority of machine learning models trained on satellite imagery \textbf{(SatML)}, however, are designed primarily for \emph{optical} input modalities such as multi-spectral satellite imagery. To better understand the value of using other input modalities alongside optical imagery in supervised learning settings, we generate augmented versions of SatML benchmark tasks by appending additional geographic data layers to datasets spanning classification, regression, and segmentation. Using these augmented datasets, we find that fusing additional geographic inputs with optical imagery can significantly improve SatML model performance. Benefits are largest in settings where labeled data are limited and in geographic out-of-sample settings, suggesting that multi-modal inputs may be especially valuable for data-efficiency and out-of-sample performance of SatML models. Surprisingly, we find that  hard-coded fusion strategies outperform learned variants, with interesting implications for future work.
\end{abstract}
\begin{keywords}
Machine Learning for Earth Observation, Label-Efficiency. 
\end{keywords}

\section{Introduction}
\label{sec:intro}
SatML models that effectively leverage the volume and diversity of data from Earth Observation (EO) satellites have the potential to translate petabyte-scale raw data into data-driven insights. Users of SatML systems need models that can integrate these vast arrays of publicly available geographic data into a cohesive representation of the world, allowing for accurate predictions even with limited training data, or when faced with covariate shifts across time, space, spectrum, and scale \citep{rolfposition}.

While including additional input layers is clearly likely to increase performance for in-sample prediction with ample training data, the effects of adding additional input layers in settings with \emph{limited label data} and \emph{out-of-sample deployment} distributions are less clear. Additional geographic inputs could inform a SatML model with structural information that may allow the model to learn geospatial image representations with fewer labeled training samples (label-efficiency); they could also require more complex (data-hungry) models to represent the various modalities of data. Additional inputs could help SatML models generalize across regions; they could also cause models to overfit to local patterns that only manifest in-sample, which could then decrease performance. 

\textbf{In this work, we study the label-efficiency and out-of-sample generalization capability associated with adding non-optical, contextual inputs to commonly used SatML architectures.}  

As outlined in \citet{datacentric}, data-centric learning is a systematic method of algorithmic evaluation where the primary focus involves curating diverse, complete, unbiased, and relevant data for optimal model performance. We perform a \emph{data-centric} study on the benefits and nuances of leveraging these widely available geographic input layers, complementing previous lines of model-centric research that study how to utilize multi-modal inputs for a fixed training/pretraining strategy and/or model architecture. 

Our primary findings in this work are:
\begin{inparaenum}[(1)]
  \item We show improvements in label-efficiency when multi-modal, auxiliary geographic inputs are fused with optical imagery on 3 SatML task-types: Multi-label land-cover classification, land cover segmentation, and tree-cover regression.
  \item We find that these auxiliary geographic inputs are especially helpful when SatML models are evaluated OOD through results on the spatially buffered test split of the BigEarthNetv2.0 dataset \citep{bigearthnetv2}, and the OOD test cities of the EnviroAtlas dataset in Austin, TX, and Durham, NC \citep{ipm}. 
  \item Through our ablations, we find surprising results that show the ineffectiveness of finetuning SatML models arbitrarily on common benchmark tasks with these auxiliary geographic inputs. 
\end{inparaenum}

Our contributions also include a large-scale, multi-dataset release containing modified versions of the SustainBench farmland boundary delineation dataset \citep{sustainbench}, and the USAVars tree-cover regression dataset \citep{mosaiks} with additional geographic inputs georeferenced to the optical imagery. Additionally, we release the BigEarthNetv2.0 dataset \citep{bigearthnetv2} with pre-computed patch-embeddings with the SatCLIP location encoder \citep{satclip}. A full list of contributed data products is shown in column ``Additional Data Layers'' in \Cref{tab:dataset_modalities}.


\section{Prior Work}
\subsection{Multi-Modal SatML}
\label{sec:prior_work} Adding a non-optical context to machine learning models trained on geospatial imagery has been performed extensively in prior work. \citet{loc1} extracts GPS features from the Yahoo Flickr Creative Commons 100M dataset, and fuses embeddings of location information with final embeddings from a convolution-based image network. \citet{loc2} incorporates geolocation information into fine-grained image classification through the use of geolocation priors, introducing the computer vision community to geo-aware neural networks. \citet{mac2019presence} performed fine-grained image classification with a location, time, and photographer prior to differentiate between similar classes that are spatially disparate. \citet{benson2024multi} add a contextual input to predict future vegetation state given temporally rich satellite imagery and future weather information. \citet{wang2020urban2vec} propose an unsupervised multi-modal framework which incorporates both street view imagery and point-of-interest data to learn neighborhood embeddings in urban areas. \citet{opensentinelmap, osm2, osm3} introduce large-scale Sentinel-2 datasets georeferenced with OpenStreetMap (OSM) rasters \citep{osm} converted to be used as a land-use-land-cover map (LULC). However, these methods, which utilize geographic data layers publicly available, intend for their usage to be restricted as ground-truth masks for land-cover classification problems. 

Recently, \citet{mmearth} introduce large, multi-modal pre-training datasets built with Sentinel-2 imagery that contain several geographic modalities like ESA WorldCover \citep{zanaga2022esa} and Digital Elevation Model. Although MMEarth \citep{mmearth} is pre-trained on these modalities, it is only used to predict the modalities given a Sentinel-2 RGB image as input; nonetheless, they find data-efficiency improvements when their self-supervised models are linear-probed on various downstream classification tasks. \citet{multimae} utilize the Aster-DEM and the ESA-Worldcover raster produced by \citet{mmearth} as additional input to a masked autoencoder (MAE). However, a bulk of their experiments is performed with various permutations of Sentinel-2-derived multispectral modalities. 


\subsection{Token Fusion}
Studies on Vision Transformers (ViTs) have explored the use of additional tokens to improve performance and capture more nuanced information. In \citet{vit}, a \emph{class token} (\texttt{[CLS]}) was introduced and appended to the patch embeddings, enabling the model to learn a global representation useful for classification tasks. \citet{deit} introduce a \emph{distillation token} to facilitate knowledge transfer from a teacher model, boosting accuracy without substantially increasing computational cost. \citet{vpt, vpt2} demonstrate that injecting a small set of learnable prompts into the early layers of pre-trained ViTs can effectively adapt them to new downstream tasks. \citet{registers} highlights the importance of internal ``registers'' in ViT architectures, arguing that specialized design choices can better accommodate these additional tokens for more robust representations.

\input{sec/methods_figure}

\subsection{Geographic Data Fusion}
  
\Cref{fig:schematic} contains an overview of the proposed geographic input-fusion techniques used in this work. For land cover segmentation with the EnviroAtlas dataset, we fuse the original inputs (NAIP aerial imagery) with roads, waterways, and waterbody data from the OSM repository \citep{osm} using the fusion method \texttt{STACK}. We compute the hand-crafted prior for the training split in Pittsburgh, and test splits in Austin and Durham using the methodology proposed in \citet{ipm}. The generation of the prior is denoted by $f(\cdot)$ in \Cref{fig:schematic}, and is described in detail in appendix \Cref{sec:priorgen}. The resulting prior along with the raw geographic data layers are used as input to the prior function and are fused to the SatML. The generation of the prior followed by fusion with the optical input forms our fusion method \texttt{PROC-STACK}.

For the farmland-parcel delineation task with the SustainBench dataset, and the socioeconomic regression task with the USAVars dataset, we use OSM raster layers that contain all the geographic data layers used for the EnviroAtlas dataset, with the addition of several new land-use and land-cover classes that are roughly relevant to the task. These additional raster layers include high-level biome information such as forests, wetlands, or urban-type terrain. Output Geodataframes are pre-processed to RGB space. We apply a smoothing kernel ($\sigma = 1.0$) to remove sharp edges and features from the API response. A complete list of raster inputs queried for the USAVars dataset is detailed in appendix \Cref{fig:sample_osm}. Additionally, we pull a digital elevation map (DEM) from the Continental Europe Digital Terrain Model available as part of the OpenTopography API. The DEM raster, originally available at a $20$m GSD, is resized to the Sentinel-2 RGB spatial resolution of $10$m/px. Unlike the OSM rasters, the DEM is passed as raw input with fusion mechanism \texttt{STACK}.

To be comparable to previous benchmark results, we use a fully convolutional network (FCN) for the EnviroAtlas \citet{ipm} Dataset, a U-Net \citep{unet} for the SustainBench-field-delineation dataset \citet{sustainbench}, and a ResNet50 \citet{resnet} for the regression task proposed in the USAVars dataset \citet{mosaiks}. 

For the BigEarthNetv2.0 image-level multi-label classification task we use vision transformer (ViT, ViT-B/8, ViT-S/8) architectures. To the Sentinel-2 input, we fuse general-purpose global SatCLIP location embeddings \citep{satclip}, which distill socioeconomic and environmental signals in satellite imagery into a pretrained location encoder $g(\textrm{lat,lon})$ with output dimension 256. Embeddings from SatCLIP's location encoder are passed as an auxiliary token to the ViT's encoder along with image tokens. We add a linear layer to SatCLIP's location encoder that maps the 256-dimensional SatCLIP embeddings to the desired sequence length expected by the Vit-S/ViT-B. The auxiliary SatCLIP token is assigned a positional encoding of $N+1$ where $N$ is the total number of encoder tokens excluding the classification token. For our main experiments, the parameters within the SatCLIP model $g(\textrm{lat,lon})$ are frozen; we experiment with unfreezing these weights in \Cref{fig:ft_satclip} and \Cref{sec:ablations}.  


% \begin{table*}[htb]
% \footnotesize
% \caption{\textbf{Experimental framework and source tasks used in this work:} We test fusion mechanisms \texttt{STACK} and \texttt{STACK-PROC} on the EnviroAtlas \citep{ipm}, SustainBench \citep{sustainbench}, and the USAVars \citep{mosaiks} benchmark datasets. We test fusion mechanism \texttt{TOKEN-FUSE} on the BigEarthNetv2.0 \citep{bigearthnetv2} classification dataset. Labels queried that form OSM rasters are shown in appendix \Cref{fig:sample_osm}. \textdagger~denotes geographic data layers released with this work (aligned with the benchmark datasets). }
% \label{tab:dataset_modalities}
% \begin{tabularx}{\textwidth}{ L{2.5cm} X L{2.6cm} c L{3.2cm} C{1.0cm}@{} }
% \toprule
% \textbf{Dataset} & \textbf{Task Description} & \textbf{Multispectral Input} & \textbf{Model} & \textbf{Additional Data Layers} & \textbf{OOD?} \\[0.1em]
% \midrule
% SustainBench \citep{sustainbench} & Farmland boundary delineation & Sentinel-2 RGB & U-Net &
%   OSM rasters\textdagger, EU-DEM\textdagger & \xmark \\[0.3em]

% EnviroAtlas \citep{ipm} & Land-cover segmentation & NAIP RGB + NIR & FCN &
%   Prior \citep{ipm}, OSM rasters & \checkmark \\[0.1em]

% BigEarthNetv2.0 \citep{bigearthnetv2} & Land-cover classification & Sentinel-2 (10 bands) & ViT &
%   SatCLIP \citep{satclip} embeddings\textdagger & \checkmark \\[0.5em]

% USAVars \citep{mosaiks} & Tree-cover regression & NAIP RGB + NIR & ResNet-50 &
%   OSM rasters\textdagger & \xmark \\[0.3em]
% \bottomrule
% \end{tabularx}
% \end{table*}

\begin{table*}[htb]
\footnotesize
\caption{\textbf{Experimental framework and source tasks used in this work:} We test fusion mechanisms \texttt{STACK} and \texttt{STACK-PROC} on the EnviroAtlas \citep{ipm}, SustainBench \citep{sustainbench}, and the USAVars \citep{mosaiks} benchmark datasets. We test fusion mechanism \texttt{TOKEN-FUSE} on the BigEarthNetv2.0 \citep{bigearthnetv2} classification dataset. Labels queried that form OSM rasters are shown in appendix \Cref{fig:sample_osm}. \textdagger~denotes geographic data layers released with this work (aligned with the benchmark datasets). }
\label{tab:dataset_modalities}
\begin{tabular*}{\textwidth}{@{\extracolsep{\fill}} 
    p{2.5cm}  % Dataset
    p{3.2cm}  % Task Description, reduced
    p{1.9cm}  % Multispectral Input
    c         % Model
    p{2.5cm}  % Additional Data Layers
    c         % OOD?
    @{}}
\toprule
\textbf{Dataset} & \textbf{Task Description} & \textbf{Multispectral Input} 
  & \textbf{Model} & \textbf{Additional Data Layers} & \textbf{OOD?} \\[0.1em]
\midrule
SustainBench \citep{sustainbench} 
  & Farmland boundary delineation 
  & Sentinel-2 RGB 
  & U-Net 
  & OSM rasters\textdagger, EU-DEM\textdagger 
  & \xmark \\[0.3em]

EnviroAtlas \citep{ipm} 
  & Land-cover segmentation 
  & NAIP RGB + NIR 
  & FCN 
  & Prior \citep{ipm}, OSM rasters 
  & \checkmark \\[0.1em]

BigEarthNetv2.0 \citep{bigearthnetv2} 
  & Land-cover classification 
  & Sentinel-2 (10 bands) 
  & ViT 
  & SatCLIP \citep{satclip} embeddings\textdagger 
  & \checkmark \\[0.5em]

USAVars \citep{mosaiks} 
  & Tree-cover regression 
  & NAIP RGB + NIR 
  & ResNet-50 
  & OSM rasters\textdagger 
  & \xmark \\[0.3em]
\bottomrule
\end{tabular*}
\end{table*}

\subsection{Models}
\textbf{Convolutional Architectures: } 
 In this work, we use simple, widely-used convolutional neural networks when trained on data fused with fusion mechanisms \texttt{STACK} and \texttt{PROC-STACK}. We choose simple architectures over specialized SatML model architectures because we are primarily interested in comparing different data settings and fusion strategies. We choose models to be consistent with model architectures used in prior work. For experiments on the EnviroAtlas dataset, we use a 5-layer FCN. For segmentation on the SustainBench field-boundary delineation, we use a U-Net \citep{unet} with identical architectural setup and hyperparameters as \citet{ermonsus} to allow for consistency when comparing results. For regression on the USAVars tree-cover dataset, we use a vanilla ResNet50 \citep{resnet} with randomly initialized weights.

\textbf{Vision Transformers (ViTs): } Vision Transformers (ViTs) \citep{vit} utilize the transformer architecture proposed in \citep{vaswani2017attention}. Input images are decomposed into a sequence of small, non-overlapping patches which are mapped to embeddings (tokens) with a linear-layer projection.  
% \verb|[CLS]| is a learnable additional token introduced to capture label information. 
Unlike \citep{satmae, scalemae} that use various versions of sinusoidal positional encodings that are sensitive to Ground Sampling Distance (GSD) and temporal information, we augment image patches with learnable positional encodings. 

\textbf{Learned location encoders: } Location encoders in SatML help models interpolate to new geographic regions by incorporating terrain and environmental signals given a (lat, lon) pair. SatCLIP \citep{satclip} builds on GeoCLIP \citep{geoclip}, CSP \citep{CSP}, and GPS2Vec \citep{gps2vec} by integrating a CLIP-inspired \citep{clip} contrastive learning framework specifically designed for satellite imagery from the Sentinel-2 EO satellite. SatCLIP's location encoder, which can be used out-of-the-box, accurately captures terrain, environmental, and socioeconomic signals \citep{satclip}. Unlike the previously used convolutional architectures that accept a rasterized input of geographic data projected to the correct Coordinate Reference System (CRS), models trained with the SatCLIP location encoder accept embeddings as an auxiliary token.

\subsection{Datasets}
We conduct experiments using 4 benchmark datasets in ML for remote sensing. These datasets cover different prediction tasks, multi-spectral input sources, and additional data layers used. \Cref{tab:dataset_modalities} presents an overview of the datasets and additional layers used. 
\emph{All additional geographic data layers, georeferenced with benchmark datasets, are available as a hosted dataset at }\coloredurl[magenta]{https://huggingface.co/datasets/arjunrao2000/geolayers}. We release our code that allows for training models on our datasets at \coloredurl[magenta]{https://github.com/arjunarao619/geolayers-terrabytes}.

\textbf{BigEarthNet (Classification): }The BigEarthNetv2.0 dataset \citep{bigearthnet, bigearthnetv2} is a multi-label classification task that consists of approximately 550,000 pairs of Sentinel-2 image patches, paired with ground labels of over 19 land cover classes. Our models input 10 Sentinel-2 bands to ensure consistency with benchmark results reported in \citet{bigearthnetv2}. Unlike the original BigEarthNet dataset in \citet{bigearthnet}, BigEarthNetv2.0 \citet{bigearthnetv2} constructs a training, validation, and test split by using a grid-based split assignment algorithm. Validation and test areas-of-sampling are not within the geographic extent of the training area-of-sampling, ensuring no data-leakage. Thus, our results reported on the BigEarthNetv2.0 dataset can be considered an out-of-sample validation and test. 


\textbf{EnviroAtlas (Land Cover Segmentation): } The EnviroAtlas dataset (compiled by \citet{ipm} and composed of data from \citet{enviroatlas}) consists of high-resolution ($1$m) land cover maps derived from NAIP aerial imagery. In this dataset, coarse land-cover maps from the National Land Cover Database (NLCD) are aligned with buildings, road networks, water bodies, and waterways from public sources such as the OSM project \citep{osm}. 
The ``prior'' data layer constructed in \citet{ipm} is a (hand-coded) fusion of NLCD data with OSM data, in the form of \texttt{PROC-STACK}. EnviroAtlas's train split only covers the Pittsburgh region. We use the provided out-of-sample validation and test datasets in Austin and Durham and in-distribution validation and test datasets in Pittsburgh. 

\textbf{SustainBench (Field Boundary Delineation) } The SustainBench benchmark proposed in \citet{sustainbench} contains a collection of 15 benchmark tasks in machine learning for remote sensing spanning 7 United Nations' sustainable development goals (SDGs). We use the field-delineation task which consists of Sentinel-2 imagery in France in 2017. Each input image is at a $10$m ground-sampling distance and has a size of $224 \times 224$ pixels corresponding to an approximately $5$ km$^2$ surface area covered per image. 

\textbf{USAVars Tree-cover (Regression): } \\
The USAVars dataset proposed in \citet{mosaiks} comprises approximately 100,000 pairs of NAIP aerial imagery cropped to a spatial extent of $~$1-sq-km per image containing real-valued labels of tree-cover, population density. We pull rasters of several land-cover and infrastructure-related classes from OSM \citep{osm} as a geographic input, aligned to the RGB layers. Our final set of labels cover broad biome-related land-cover classes such as waterbodies, forests, and buildings with fine-grained labels covering sub-categories of biomes. A complete list of labels pulled from OSM are shown in appendix \Cref{fig:sample_osm}. \\

\input{sec/sustainbench_and_usavars_figure}

\section{Results} \label{sec:results}


Across all four SatML benchmark datasets covering tasks in semantic segmentation and multi-label classification  for land cover, field boundary delineation, and regression, we found that adding contextual, geographic inputs improves model performance, with largest gains in settings with limited label data (\Cref{sec:results-data-efficiency}) and out-of-distribution test sets (\Cref{sec:results-OOD}). Ablation experiments (\Cref{sec:ablations} ) provide evidence that fine-tuning encoders aided by geographic input layers does not necessarily help in these critical settings.


\input{sec/bigearthnet_results}
\subsection{Geographic inputs can aid data-efficiency}
\label{sec:results-data-efficiency}

The benefit of additional geographic data inputs on data-efficiency of SatML models can be seen in all four experimental settings and all three fusion mechanisms.

From \Cref{fig:sustainbench}, we see performance improvements with low amounts of training data when using the \texttt{STACK} approach to fuse additional raster layers. A U-Net trained with an OSM and DEM raster layer using fusion mechanism \texttt{STACK} exhibits an $8.1\%$ test dice score improvement in-sample when trained on between 1-5$\%$ of training data on the SustainBench field-boundary delineation dataset, compared to a $4.1\%$ improvement when using the full training dataset. From appendix \Cref{tab:susbench_model_ablations}, we find that these performance improvements hold with most commonly used SatML segmentation model architectures introduced over the past five years. From \Cref{fig:usavars_treecover}, stacking OSM raster layers as input to a ResNet-50 for the USAVars tree-cover regression task improves $R^2$ by $0.162$ points when trained on between 60 to 250 training images. This performance improvement reduces to a 0.026 improvement in $R^2$ when the full 68,000 image training dataset is used. 

From \Cref{tab:enviroatlas}, we find that a prior generated and fused with \texttt{PROC-STACK} improves in-distribution test accuracy of land-cover segmentation on the EnviroAtlas dataset \citep{enviroatlas} by $9.3\%$ when trained on between $1$ to $5\%$ of the training dataset, compared to a $0.6\%$ improvement when trained with the full training dataset. When the raw data-layers used to generate the prior in \citet{ipm} are fused with fusion mechanism \texttt{STACK} before training, data-efficiency improvements drop to approximately $2\%$ over ten random seeds for this range ($1-5\%$) of training data, still an improvement.

% In \Cref{tab:vit_table}, a vision transformer trained with an auxiliary pretrained SatCLIP token (ResNet18, $L=10$) fused with \texttt{TOKEN-FUSE} improves performance on the BigEarthNetv2.0 dataset. When trained on between $1$ to $5\%$ of the training dataset, the token-aided ViT improves multi-label classification average precision  by $3.3\%$ and multi-label F1 score by $4. \%$, (numbers are averaged across training set sizes of $1$, $2$, and $5\%$ of the total training set). Performance boosts are more modest when more training data is available. Using $100\%$ of the training data, the improvement in average precision and F1 score drops to $0.8\%$ and $1.1\%$ respectively, highlighting the label-efficiency associated with passing pre-trained geospatial embeddings as auxiliary multi-modal token to the ViT. 

\input{sec/enviroatlas_figure}

On the SustainBench field-boundary delineation and the USAVars tree-cover regression datasets, we note that largest gains in label-efficiency are observed with training dataset sizes of 100-700 images, which we observe to be the low-data-regime where geographic input layers consistently outperform models trained on optical modalities. For example, on the USAVars tree-cover regression task, we observe a diminished gap in the test $R^2$ metric as we scale from 700 training samples ($\Delta_{R^{2}} = 0.36$) to 1400 training samples ($\Delta_{R^{2}} = 0.08$).

We also note that not \emph{all} geographic inputs/combinations of these inputs improve label-efficiency and OOD performance when fused with the SatML model using the fusion mechanisms introduced in \Cref{fig:schematic}. In \Cref{fig:enviroatlas}, we note that a road-map raster worsens performance compared to standard, multispectral-only training. Similarly, from \Cref{fig:sustainbench}, concatenating a single DEM raster to optical imagery for a field-boundary delineation task on the SustainBench dataset hurts performance in these settings. 


\input{sec/master_ablation_results}
\subsection{Geographic inputs can aid out-of-distribution performance}
\label{sec:results-OOD}
We also found that fusing additional geographic input layers to remotely sensed imagery can significantly aid geographic domain generalization. While the value of additional input layers is clear in low-label settings (here $<$800 training points) for all test cities in the EnviroAtlas dataset, \Cref{fig:enviroatlas} also shows an improvement in overall test accuracy across all amounts of training data for the out-of-distribution test cities in different states (Austin, TX and Durham, NC). We observe a $4.12\%$ improvement in the overall accuracy with the prior geographic data layer using \texttt{PROC-STACK} and a $2.03\%$ improvement when the raw raster data layers used to generate the prior are fused with \texttt{STACK}. Unlike the ID test set (Pittsburgh), the gains in performance in the OOD settings do not appear to diminish with more training samples, as the OOD performance curves remain significantly separated across settings, even using 100\% of the training data.

From \Cref{tab:vit_table}, performance improvements on the BigEarthNetv2.0 dataset with the auxiliary SatCLIP token fused with \texttt{TOKEN-FUSE} also hold over all training data subsets. This reflects OOD performance as the BigEarthNetv2.0  validation and test splits use a spatial buffering approach \cite{bigearthnetv2}.
For a ViT-B, we observe a $3.1\%$ improvement in the multi-label F1 metric, and a $2.5\%$ improvement in the multi-label average precision metric. Interestingly, for a ViT-S, this improvement in out-of-sample accuracy across all data subsets drops to a $2\%$ improvement in average precision and a $1\%$ improvement in the multi-label F1 metric. We hypothesize that this difference in performance can possibly be attributed to the reduced \emph{model expressivity} of ViT-S that prevents it from fully exploiting the SatCLIP auxiliary token (embedding size of $384$ vs $768$). 




\subsection{Finetuning geographic-input aided SatML models can hurt label-efficiency and OOD performance} \label{sec:ablations}
To determine if geographic inputs that are learned during training aid label efficiency and out-of-sample generalization of SatML models on commonly used benchmark datasets, we conduct ablation studies for the fusion mechanisms \texttt{TOKEN-FUSE} and \texttt{PROC-STACK}. In sections \Cref{sec:results-data-efficiency,sec:results-OOD}, we freeze the intermediate modules $f(\cdot)$ in \texttt{PROC-STACK} and $g(\cdot)$ in \texttt{TOKEN-FUSE} ($f(\cdot)$ and $g(\cdot)$ from \Cref{fig:schematic}). In \Cref{sec:learned-compression,sec:finetuned-satclip}, we finetune these modules jointly with the SatML model.


\subsubsection{Learned compression with \texttt{PROC-STACK}}
\label{sec:learned-compression}
To understand when a compressed embedding of geographic rasters can confer similar results as using all as input, we design a trainable \texttt{PROC-STACK} fusion mechanism used to train a U-Net on the SustainBench field boundary delineation task. In this approach, we pass both the DEM (1 channel) and OSM (19 channels) geographic data layers to a trainable FCN architecture. Outputs from the FCN are stacked with the original optical input and passed to the U-Net, and both models are trained simultaneously\footnote{To accommodate for the increased number of trainable parameters, we increase the number of epochs the models are trained on and allow for convergence.}.

Label efficiency on the SustainBench field boundary delineation dataset is shown in \Cref{tab:sustainbench_ablation}. The fusion mechanism \texttt{PROC-STACK} on learned, compressed inputs is not competitive with a simple \texttt{STACK} of the pre-processed, original rasters. Interestingly, we observe significantly improved label efficiency of the trained \texttt{PROC-STACK} ablation model between subsets $1\%$ and $5\%$. These label efficiency improvements, however, do not hold across all subsets. 


\input{sec/trainable_satclip_results}

\subsubsection{Fine-tuning location encoders in \texttt{TOKEN-FUSE}}
\label{sec:finetuned-satclip}
For classification on the BigEarthNetv2.0 dataset, instead of using a frozen SatCLIP encoder with a learnable linear projection layer (as in \Cref{sec:results-OOD}), we now allow for the SatCLIP model to be trainable given the original pre-trained SatCLIP location encoder weights. 
%This novel training setting is equivalent to training the SatCLIP location encoder on an image-classification pseudo task. 

We find that the label-efficiency and out-of-sample performance degrade when the SatCLIP weights are learnable during training (\Cref{tab:satclip_ablation}). \Cref{fig:pairwise_sims} shows that fine-tuning the SatCLIP model in this fashion leads to embeddings that are highly localized within various countries covered by the BigEarthNetv2.0 dataset. This suggests that the augmented ViT may be overfitting to the auxiliary SatCLIP token, leading to lower test set performance when the SatCLIP model is trainable. Furthermore, overfitting is particularly likely considering that the trainable weights of the SatCLIP location encoder span $360$k parameters -- significantly higher than other image tokens input to the ViT. \footnote{Addition of layer-normalization to the SatCLIP token doesn't significantly alter performance, label-efficiency, and OOD generalization.}    



To understand the performance discrepancy between the fine-tuned and frozen location encoders in the \texttt{TOKEN-FUSE} strategy,  we compare performance of our auxiliary SatCLIP token against a generic (non-geospatial) learnable register token as a baseline. First introduced in \cite{registers}, register tokens are randomly initialized, fully-trainable prefix tokens. Register tokens capture high-norm ``outlier" artifacts that hold significantly lower local-patch information. ViTs aided with registers show improvements only when trained with sufficiently large numbers of trainable parameters (ViT-B, ViT-L, ViT-H) over long training durations. We choose a ViT-B ($86$M trainable parameters) with identical hyperparameters as experiments that produced results in \Cref{tab:vit_table}.

From \Cref{tab:satclip_ablation}, we find that registers do not improve label efficiency and out-of-sample performance of a ViT-B trained on the BigEarthNetv2.0 dataset compared to a frozen SatCLIP location encoder. We find that adding additional register tokens up to 3 tokens doesn't significantly alter this result. Interestingly, both a register token and a fine-tuned SatCLIP token outperform a vanilla ViT-B when trained on between $1\%$ to $20\%$ of training data, but perform worse than a vanilla ViT in the large-data ($50\%$, $100\%$) regime. 

\section{Experimental Takeaways}
\textbf{Takeaway 1: Auxiliary geographic inputs improve performance in low-data settings.}
In \Cref{sec:results-data-efficiency}, we find notable performance improvements in low-data settings with an auxiliary OSM and DEM geographic input layer (0.08 IoU on SustainBench, 9.3\% OA on EnviroAtlas, 0.162 R$^2$ improvement on USAVars). On the SustainBench field boundary delineation task, a U-Net trained with an OSM and EU-DEM raster matches the test IoU of an RGB-only model with only 224 training samples (compared to 1573 training samples for the RGB-only model).

\textbf{Takeaway 2: Auxiliary geographic inputs improve performance OOD.} From \Cref{sec:results-OOD}, we find that these geographic layers are especially helpful when evaluated on OOD splits of the benchmark datasets: 4.12\% improvement in EnviroAtlas's OOD cities, 3.1\% improvement on BigEarthNetv2.0's spatially-buffered test splits. 

\textbf{Takeaway 3: Finetuning SatML models aided by auxiliary geographic inputs \emph{can} hurt performance.} Surprisingly, when we allow the intermediate module in \texttt{PROC-STACK} (denoted by $f(\cdot)$ in \Cref{fig:schematic}) to be trainable and act as a geographic input compression module, test IoU scores drop, on average, by $4.1\%$ on the \Cref{tab:sustainbench_ablation}) test set. Higher performance drops occur as the expressivity of the intermediate FCN is increased from 1 to 3 output channels. From \Cref{sec:finetuned-satclip}, we find that allowing a SatCLIP encoder to be jointly trained with the SatML classification model causes the model to overfit (\Cref{fig:pairwise_sims}), hurting label efficiency and OOD performance in the BigEarthNet task.

\textbf{Limitations and future work: } In \Cref{fig:enviroatlas,fig:sustainbench,fig:vit_ben_plots,fig:usavars_treecover,tab:enviroatlas,tab:vit_table}, we use geographic data-layers that make sense for the downstream task. As we are primarily interested in potential benefits of using additional data layers, we restrict the scope of the study only to these geographic input layers and do not train on a larger corpus of raster and scalar inputs. Here, we use fusion mechanisms \texttt{STACK}, \texttt{PROC-STACK} for convolutional models and \texttt{TOKEN-FUSE} for ViTs since they involve minimal modifications to the source architectures; future work will examine more sophisticated fusion mechanisms.

\section*{Acknowledgements}
A majority of training runs conducted in this work were run on an NVIDIA Grace-Hopper (GH200) GPU node provided by the University of Colorado Boulder's high performance computing system Alpine. We thank Brandon Reyes and the RC computing team at CU Boulder for allowing access to this resource. Alpine is jointly funded by the University of Colorado Boulder, the University of Colorado Anschutz, Colorado State University, and the National Science Foundation (award 2201538). 

OpenStreetMap is open data, licensed under the \href{https://opendatacommons.org/licenses/odbl/}{Open Data Commons Open Database License} by the \href{https://osmfoundation.org/}{OpenStreetMap Foundation} (OSMF).

DEM data in this work is derived from services provided by the OpenTopography Facility with support from the National Science Foundation under NSF Award Numbers 2410799, 2410800 \& 2410801 \citep{opentopo}.

We thank Dr. Caleb Robinson for invaluable feedback during the writing stage of this work. We also thank the anonymous reviewers for their comments and suggestions. 

\section*{Impact Statement}
By lowering annotation costs and delivering consistent accuracy when models cross regional, temporal or sensor boundaries, our approach can democratize high‑impact Earth‑observation applications such as crop monitoring, disaster assessment and biodiversity mapping for organizations with limited resources. Because the fusion layers are lightweight and the best results come from \emph{frozen} tokens using \emph{pretrained} encoders, our work avoids the large training footprints typical of foundation‑model fine‑tuning, mitigating energy use relative to existing alternatives. However, the work also surfaces risks: uneven coverage or quality in auxiliary datasets (e.g., OSM) could entrench geographic biases, and fine‑tuning the location encoder can cause severe overfitting to local patterns. Practitioners should therefore audit input‑layer availability and monitor model generalization before deployment in safety‑ or equity‑critical settings.


\bibliography{pmlr-sample}

\appendix

\input{sec/X_suppl}

\end{document}
