\clearpage
\onecolumn
\setcounter{page}{1}


\section{Experimental Setup}
\subsection{FCN on EnviroAtlas Land Cover Segmentation with \texttt{STACK}, \texttt{PROC-STACK}: } 

We train on EnviroAtlas' train images in Pittsburgh, PA on a 5-layer Fully Convolutional Network with 64 filters and an output smoothing of $10^{-2}$. A batch size of $128$ and a learning rate of $1e-3$ are fixed across all training data subsets and random seeds reported in \Cref{fig:enviroatlas}. We fix the lower bound learning rate to $1e-7$. Table \ref{tab:subset_epochs} reports the number of training epochs each data-efficient FCN is trained on. Note that FCNs trained on $1\%$ of EnviroAtlas' training data for 700 epochs trigger our early-stopping logic between epoch 200-300. We use TorchGeo's \verb|RandomGeoSampler| with an input image size of 128. Our test dataset uses TorchGeo's \verb|GridGeoSampler| with an input image size of $256$ and a stride of $512$ to avoid overlapping image patches. Our multi-modal inputs include a road, water, waterway, and waterbody footprint from \citep{osm}. 

\paragraph{Hand‐crafted prior generation process.}
\label{sec:priorgen}
In our PROC‐STACK experiments, the hand‐crafted prior \(f(x_i)\equiv p_i(\ell)\) is constructed exactly as in \cite{ipm} (“Coarse data in weakly supervised segmentation”, §3), using the NLCD 30 m land‐cover map to induce per‐pixel beliefs over our four high‐resolution classes. Concretely, we first compute the empirical co‐occurrence matrix
\[
P(\ell \mid c)
\;=\;
\frac{\bigl|\{\,\text{high‐res label}=\ell,\;\text{NLCD class}=c\}\bigr|}
{\sum_{\ell'}\bigl|\{\,\text{high‐res label}=\ell',\;\text{NLCD class}=c\}\bigr|}
\]
from a held‐out set of aligned NAIP+NLCD+Land Cover tiles. Then, for each pixel \(i\) with NLCD class \(c_i\), we set
\[
p_i(\ell)\;=\;P(\ell\mid c_i)
\]
and apply a small Gaussian blur (\(\sigma=1\) pixel) to smooth block artifacts. In PROC‐STACK mode, we further enrich this prior with binary auxiliary masks (roads, buildings, waterways): for each feature \(j\), we define
\[
M_j(i)=
\begin{cases}
1, & \text{if feature }j\text{ lies within a 10 m radius of pixel }i,\\
0, & \text{otherwise,}
\end{cases}
\]
and boost the corresponding class by adding a fixed weight \(w_j\) to \(p_i(\ell=j)\). Finally, we re-normalize \(p_i(\ell)\) so that \(\sum_{\ell}p_i(\ell)=1\). This yields a spatially varying, hand‐crafted prior that both encodes coarse NLCD statistics and injects domain knowledge via auxiliary GIS layers, as required by the PROC‐STACK formulation.


\begin{table}[h!]
  \centering
  \setlength{\tabcolsep}{5pt} % reduce column spacing
  \renewcommand{\arraystretch}{1.0} % reduce row spacing
  \begin{tabular}{cc}
    \toprule
    \textbf{Subset Size} & \textbf{Training Epochs} \\
    \midrule
    100\% & 7 \\
    75\%  & 9 \\
    50\%  & 14 \\
    35\%  & 20 \\
    20\%  & 35 \\
    10\%  & 70 \\
    5\%   & 140 \\
    2\%   & 350 \\
    1\%   & 700 \\
    \bottomrule
  \end{tabular}
  \caption{Training epochs scaled by subset size for all label-efficiency experiments.}
  \label{tab:subset_epochs}
\end{table}

\subsection{ViT on BigEarthNetv2.0 Multi-label classification with \texttt{TOKEN-FUSE}}

\begin{figure*}[t!]
    \centering
    \includegraphics[width=\linewidth]{figures/vit-ben-plot.png}
    \caption{\textbf{Label efficiency of a ViT trained with an auxiliary SatCLIP token. } \textbf{Left: } ViT-Base (86M trainable parameters). SatCLIP linear projection layer mapped to embedding dimension of $768$. \textbf{Right: } ViT-small (22M trainable params), SatCLIP linear projection layer mapped to embedding dimension of $384$.}
    \label{fig:vit_ben_plots}
\end{figure*}

Our experiments with the Vision Transformer (ViT) use a ViT-Base and a ViT-Small ($86$M and $22$M trainable parameters) with a fixed patch size of $8$. All ViTs are randomly initialized for a fixed random seed. We prepend a learnable location token \(x_{\text{loc}} \in \mathbb{R}^D\) to the input sequence in addition to a class token \(x_{\text{cls}} \in \mathbb{R}^D\) and \(N\) patch tokens \(x_{\text{patch}}^{(i)} \in \mathbb{R}^D\). The token sequence is given by
\[
X_{\text{tokens}} = \Bigl[ x_{\text{cls}}; \; x_{\text{loc}}; \; x_{\text{patch}}^{(1)}, \dots, x_{\text{patch}}^{(N)} \Bigr] \in \mathbb{R}^{(N+2)\times D},
\]
We add corresponding learnable positional embeddings 
\[
E_{\text{pos}} = \Bigl[ e_{\text{cls}}; \; e_{\text{loc}}; \; e_{\text{patch}}^{(1)}, \dots, e_{\text{patch}}^{(N)} \Bigr].
\]
Our final sequence is \(z_0 = X_{\text{tokens}} + E_{\text{pos}}\). With the addition of the auxiliary SatCLIP token with \texttt{TOKEN-FUSE}, our sequence length is increased by one and allows the model to jointly encode class and location information.
\paragraph{Why SatCLIP?} SatCLIP is currently the only location encoder in previous work that is pre-trained on Sentinel-2 satellite imagery, hence making it a suitable candidate for our experiments that primarily train, validate, and test on geospatial satellite imagery. Future work will incorporate the label-efficiency and out-of-sample performance for SatML models trained with newer location encoders that are pre-trained with satellite or geospatial imagery. 

Our experiments on the BigEarthNetv2.0 dataset use a batch size of $700$ and run for $15$ epochs ($5$ warmup epochs) at a base learning rate of $5e-4$. We use a dropout rate of $0.15$ to prevent overfitting across all settings (Finetuned SatCLIP (\texttt{FT}), Frozen SatCLIP (\texttt{F}), and Register token). We record macro and micro-averaged average precision, recall, and F1 score in addition to class-wise accuracies. \Cref{fig:vit_ben_plots} shows label-efficiency results (similar to \Cref{tab:vit_table}) of a frozen SatCLIP auxiliary token with a learnable linear projection layer on the BigEarthNet2.0 dataset. 

\subsection{U-Net on SustainBench Field Boundary Delineation with \texttt{STACK}}
Our standard U-Net setup consists of 4 downsampling blocks, a bottleneck, and corresponding upsampling blocks with skip connections. Input images are georeferenced with a pre-processed OSM raster and are stored as an HDF5 dataset with 7 total channels. We use a random crop, horizontal, and vertical flip augmentation during training and a center crop for evaluation. The model is trained for 20 epochs with a batch size of 48 at a learning rate of \(1\times10^{-4}\). A learning rate scheduler cognizant of validation loss plateaus is used (factor 0.5, patience 5). We record the Dice coefficient, and the IoU score.

\paragraph{Ablations with model architectures: } We conduct a broad survey of commonly used SatML model architectures for semantic segmentation tasks from published work spanning 2020 to 2025. We find that most commonly used segmentation architectures include:
\begin{itemize}
  \item Fully Convolutional Networks (FCN) \cite{long2015fully}
  \item U-Net \cite{ronneberger2015u, hou2021cunet}
  \item SegNet \cite{badrinarayanan2017segnet, weng2020water}
  \item PSPNet \cite{zhao2017pyramid, yuan2022shift}
  \item DeepLabv3+ \cite{chen2018encoder}
  \item SegFormer \cite{xie2021segformer}
  \item MA-Net \cite{fan2020ma}
\end{itemize}


We choose 4 commonly used segmentation model architectures from the list above, and perform the label-efficiency experiments similar to \Cref{sec:results-data-efficiency} with and without an auxiliary geographic input of an OSM and EU-DEM raster layer. From \Cref{tab:susbench_model_ablations}, we see that our performance improvements hold consistently over all data subsets with the auxiliary geographic input.

\subsection{ResNet50 on USAVars Regression with \texttt{STACK}}
Our generated USAVars dataset comprises images with 7 channels and corresponding scalar labels. A custom \texttt{HDF5Dataset} class is used to load the data. For 7-channel inputs, the first four channels are normalized to \([0,1]\) by division by 255, while channels 4--6 are scaled from the original categorical values returned from the OSM API to the RGB space. Random cropping (to an image size of 256), horizontal, and vertical flips are applied during training, while a center crop is used for validation and testing. To accommodate inputs with 4 or 7 channels, the initial convolutional layer of ResNet50 is re-initialized accordingly. The final fully-connected layer is replaced with a linear layer outputting a single value for regression. We use a base learning rate of $1e-4$ with a batch size of $512$. We train the model for $20$ epochs. All experiments are seeded for reproducibility and results are reported over five random seeds. We record the mean squared error loss and the \(R^2\) score.

\begin{figure*}[ht!]
    \centering
    \includegraphics[width=\linewidth]{figures/satclip_embeddings_supplementary.pdf}
    \caption{\textbf{Qualitative Result: Frozen \texttt{F} vs Finetuned \texttt{FT} SatCLIP auxiliary token} [Top-left] Cosine distance of standard SatCLIP embeddings to a fixed reference point in Austria. [Top-Right] Absolute difference between cosine distances between our \texttt{F} SatCLIP location encoder + trained linear projection layer and original SatCLIP location encoder cosine distances. [Bottom] Global PCA embeddings of \texttt{F} vs \texttt{FT} SatCLIP auxiliary token with \texttt{TOKEN-FUSE}}
    \label{fig:satclip_supplementary}
\end{figure*}

\begin{figure}[b!]
    \centering
    \includegraphics[width=\textwidth]{figures/satclip_embeddings_v8.pdf}%
    \quad
    \caption{\textbf{Qualitative result: Frozen vs Finetuned SatCLIP auxiliary ViT token on the BigEarthNetv2.0 land-cover classification task:}  
      Maps: PCA embeddings of the SatCLIP tokens: frozen (left) vs finetuned (right) on 10 European countries covered by the BigEarthNetv2.0 dataset.}
    \label{fig:ft_satclip}
\end{figure}

\clearpage
\section{Qualitative Result: \texttt{TOKEN-FUSE}}
To qualitatively evaluate the quality of embeddings learned by our linear projection layer, which is responsible for mapping the 256-dimensional SatCLIP embeddings to the token size expected by the ViT, we calculate the disagreement of this learned layer with a standard SatCLIP location encoder. With a fixed reference SatCLIP embedding in Austria ($E_{\text{Austria}}$), we calculate the cosine distance between SatCLIP embeddings of 200,000 global, randomly sampled SatCLIP embeddings with $E_{\text{Austria}}$. The disagreement of our learned linear projection layer is calculated by repeating the same procedure after passing standard SatCLIP embeddings through the learned linear projection layer before calculating the cosine distance. \Cref{fig:satclip_supplementary} [top-right] shows that our learned linear projection layer successfully maps SatCLIP embeddings to the SatCLIP auxiliary token without a significant disagreement from original embeddings. \Cref{fig:satclip_supplementary}[Bottom] also shows a PCA visualization of a frozen (\texttt{F}) vs finetuned (\texttt{FT}) SatCLIP auxiliary token's embeddings mapped to RGB space. \Cref{fig:ft_satclip} surprisingly shows these PCA embeddings for countries covered by the train-split of the BigEarthNetv2.0 dataset \citep{bigearthnetv2}. These results support our observation in \Cref{fig:ft_satclip} that show that a finetuned SatCLIP token with \texttt{TOKEN-FUSE} learns high-resolution, arbitrary information compared to a frozen token. 

\begin{figure}[ht!]
    \centering
    \includegraphics[width=0.55\linewidth]{figures/osm_usavars_small.pdf}
    \caption{\textbf{NAIP Imagery from USAVars \citep{mosaiks} georeferenced with our OpenStreetMaps (OSM) raster geographic data-layer: } OSM products are smoothed with a Gaussian Kernel and pre-processed to RGB space.}
    \label{fig:sample_osm}
\end{figure}
\input{sec/susbench_model_ablations}

