\begin{figure*}[t!]
 \centering
 \includegraphics[width=\linewidth]{figures/Input_Fusion_Schematic.pdf}
 \caption{\textbf{Geographic data input fusion mechanisms used in this work: } \texttt{STACK} involves concatenating one or more  geographic raster inputs with the optical input before passed jointly as an input to a convolution-based architecture. \texttt{PROC-STACK} passes the geographic input to a function $f(\cdot)$ before stacking the geographic data with the optical input. \texttt{TOKEN-FUSE} passes a latitude-longitude pair to a location encoder $g(\cdot)$ and uses location embeddings as an auxiliary token to a Vision Transformer (ViT). 
 %Most of our  perform ablations with a fine-tuned auxiliary token.
 Experiments in \Cref{sec:results-data-efficiency} and \Cref{sec:results-OOD} use frozen models for $f$ and $g$; ablation experiments in \Cref{sec:ablations} use trainable models.
 }
 \label{fig:schematic}
\end{figure*}   