\section{Introduction}
\label{sec:intro}

Please follow the steps outlined below when submitting your manuscript to the IEEE Computer Society Press.
This style guide now has several important modifications (for example, you are no longer warned against the use of sticky tape to attach your artwork to the paper), so all authors should read this new version.

\subsection{Language}

All manuscripts must be in English.

\subsection{Dual submission}

Please refer to the author guidelines on the CVPR\ 2023\ web page for a
discussion of the policy on dual submissions.

\subsection{Paper length}
Papers, excluding the references section, must be no longer than eight pages in length.
The references section will not be included in the page count, and there is no limit on the length of the references section.
For example, a paper of eight pages with two pages of references would have a total length of 10 pages.
{\bf There will be no extra page charges for CVPR\ 2023.}

Overlength papers will simply not be reviewed.
This includes papers where the margins and formatting are deemed to have been significantly altered from those laid down by this style guide.
Note that this \LaTeX\ guide already sets figure captions and references in a smaller font.
The reason such papers will not be reviewed is that there is no provision for supervised revisions of manuscripts.
The reviewing process cannot determine the suitability of the paper for presentation in eight pages if it is reviewed in eleven.

\subsection{The ruler}
The \LaTeX\ style defines a printed ruler which should be present in the version submitted for review.
The ruler is provided in order that reviewers may comment on particular lines in the paper without circumlocution.
If you are preparing a document using a non-\LaTeX\ document preparation system, please arrange for an equivalent ruler to appear on the final output pages.
The presence or absence of the ruler should not change the appearance of any other content on the page.
The camera-ready copy should not contain a ruler.
(\LaTeX\ users may use options of cvpr.sty to switch between different versions.)

Reviewers:
note that the ruler measurements do not align well with lines in the paper --- this turns out to be very difficult to do well when the paper contains many figures and equations, and, when done, looks ugly.
Just use fractional references (\eg, this line is $087.5$), although in most cases one would expect that the approximate location will be adequate.


\subsection{Paper ID}
Make sure that the Paper ID from the submission system is visible in the version submitted for review (replacing the ``*****'' you see in this document).
If you are using the \LaTeX\ template, \textbf{make sure to update paper ID in the appropriate place in the tex file}.


\subsection{Mathematics}

Please number all of your sections and displayed equations as in these examples:
\begin{equation}
  E = m\cdot c^2
  \label{eq:important}
\end{equation}
and
\begin{equation}
  v = a\cdot t.
  \label{eq:also-important}
\end{equation}
It is important for readers to be able to refer to any particular equation.
Just because you did not refer to it in the text does not mean some future reader might not need to refer to it.
It is cumbersome to have to use circumlocutions like ``the equation second from the top of page 3 column 1''.
(Note that the ruler will not be present in the final copy, so is not an alternative to equation numbers).
All authors will benefit from reading Mermin's description of how to write mathematics:
\url{http://www.pamitc.org/documents/mermin.pdf}.

\subsection{Blind review}

Many authors misunderstand the concept of anonymizing for blind review.
Blind review does not mean that one must remove citations to one's own work---in fact it is often impossible to review a paper unless the previous citations are known and available.

Blind review means that you do not use the words ``my'' or ``our'' when citing previous work.
That is all.
(But see below for tech reports.)

Saying ``this builds on the work of Lucy Smith [1]'' does not say that you are Lucy Smith;
it says that you are building on her work.
If you are Smith and Jones, do not say ``as we show in [7]'', say ``as Smith and Jones show in [7]'' and at the end of the paper, include reference 7 as you would any other cited work.

An example of a bad paper just asking to be rejected:
\begin{quote}
\begin{center}
    An analysis of the frobnicatable foo filter.
\end{center}

   In this paper we present a performance analysis of our previous paper [1], and show it to be inferior to all previously known methods.
   Why the previous paper was accepted without this analysis is beyond me.

   [1] Removed for blind review
\end{quote}


An example of an acceptable paper:
\begin{quote}
\begin{center}
     An analysis of the frobnicatable foo filter.
\end{center}

   In this paper we present a performance analysis of the  paper of Smith \etal [1], and show it to be inferior to all previously known methods.
   Why the previous paper was accepted without this analysis is beyond me.

   [1] Smith, L and Jones, C. ``The frobnicatable foo filter, a fundamental contribution to human knowledge''. Nature 381(12), 1-213.
\end{quote}

If you are making a submission to another conference at the same time, which covers similar or overlapping material, you may need to refer to that submission in order to explain the differences, just as you would if you had previously published related work.
In such cases, include the anonymized parallel submission~\cite{Authors14} as supplemental material and cite it as
\begin{quote}
[1] Authors. ``The frobnicatable foo filter'', F\&G 2014 Submission ID 324, Supplied as supplemental material {\tt fg324.pdf}.
\end{quote}

Finally, you may feel you need to tell the reader that more details can be found elsewhere, and refer them to a technical report.
For conference submissions, the paper must stand on its own, and not {\em require} the reviewer to go to a tech report for further details.
Thus, you may say in the body of the paper ``further details may be found in~\cite{Authors14b}''.
Then submit the tech report as supplemental material.
Again, you may not assume the reviewers will read this material.

Sometimes your paper is about a problem which you tested using a tool that is widely known to be restricted to a single institution.
For example, let's say it's 1969, you have solved a key problem on the Apollo lander, and you believe that the CVPR70 audience would like to hear about your
solution.
The work is a development of your celebrated 1968 paper entitled ``Zero-g frobnication: How being the only people in the world with access to the Apollo lander source code makes us a wow at parties'', by Zeus \etal.

You can handle this paper like any other.
Do not write ``We show how to improve our previous work [Anonymous, 1968].
This time we tested the algorithm on a lunar lander [name of lander removed for blind review]''.
That would be silly, and would immediately identify the authors.
Instead write the following:
\begin{quotation}
\noindent
   We describe a system for zero-g frobnication.
   This system is new because it handles the following cases:
   A, B.  Previous systems [Zeus et al. 1968] did not  handle case B properly.
   Ours handles it by including a foo term in the bar integral.

   ...

   The proposed system was integrated with the Apollo lunar lander, and went all the way to the moon, don't you know.
   It displayed the following behaviours, which show how well we solved cases A and B: ...
\end{quotation}
As you can see, the above text follows standard scientific convention, reads better than the first version, and does not explicitly name you as the authors.
A reviewer might think it likely that the new paper was written by Zeus \etal, but cannot make any decision based on that guess.
He or she would have to be sure that no other authors could have been contracted to solve problem B.
\medskip

\noindent
FAQ\medskip\\
{\bf Q:} Are acknowledgements OK?\\
{\bf A:} No.  Leave them for the final copy.\medskip\\
{\bf Q:} How do I cite my results reported in open challenges?
{\bf A:} To conform with the double-blind review policy, you can report results of other challenge participants together with your results in your paper.
For your results, however, you should not identify yourself and should not mention your participation in the challenge.
Instead present your results referring to the method proposed in your paper and draw conclusions based on the experimental comparison to other results.\medskip\\

\begin{figure}[t]
  \centering
  \fbox{\rule{0pt}{2in} \rule{0.9\linewidth}{0pt}}
  

   \caption{Example of caption.
   It is set in Roman so that mathematics (always set in Roman: $B \sin A = A \sin B$) may be included without an ugly clash.}
   \label{fig:onecol}
\end{figure}

\subsection{Miscellaneous}

\noindent
Compare the following:\\
\begin{tabular}{ll}
 \verb'$conf_a$' &  $conf_a$ \\
 \verb'$\mathit{conf}_a$' & $\mathit{conf}_a$
\end{tabular}\\
See The \TeX book, p165.

The space after \eg, meaning ``for example'', should not be a sentence-ending space.
So \eg is correct, {\em e.g.} is not.
The provided \verb'\eg' macro takes care of this.

When citing a multi-author paper, you may save space by using ``et alia'', shortened to ``\etal'' (not ``{\em et.\ al.}'' as ``{\em et}'' is a complete word).
If you use the \verb'\etal' macro provided, then you need not worry about double periods when used at the end of a sentence as in Alpher \etal.
However, use it only when there are three or more authors.
Thus, the following is correct:
   ``Frobnication has been trendy lately.
   It was introduced by Alpher~\cite{Alpher02}, and subsequently developed by
   Alpher and Fotheringham-Smythe~\cite{Alpher03}, and Alpher \etal~\cite{Alpher04}.''

This is incorrect: ``... subsequently developed by Alpher \etal~\cite{Alpher03} ...'' because reference~\cite{Alpher03} has just two authors.




\begin{figure*}
  \centering
  \begin{subfigure}{0.68\linewidth}
    \fbox{\rule{0pt}{2in} \rule{.9\linewidth}{0pt}}
    \caption{An example of a subfigure.}
    \label{fig:short-a}
  \end{subfigure}
  \hfill
  \begin{subfigure}{0.28\linewidth}
    \fbox{\rule{0pt}{2in} \rule{.9\linewidth}{0pt}}
    \caption{Another example of a subfigure.}
    \label{fig:short-b}
  \end{subfigure}
  \caption{Example of a short caption, which should be centered.}
  \label{fig:short}
\end{figure*}

\section{Formatting your paper}
\label{sec:formatting}

All text must be in a two-column format.
The total allowable size of the text area is $6\frac78$ inches (17.46 cm) wide by $8\frac78$ inches (22.54 cm) high.
Columns are to be $3\frac14$ inches (8.25 cm) wide, with a $\frac{5}{16}$ inch (0.8 cm) space between them.
The main title (on the first page) should begin 1 inch (2.54 cm) from the top edge of the page.
The second and following pages should begin 1 inch (2.54 cm) from the top edge.
On all pages, the bottom margin should be $1\frac{1}{8}$ inches (2.86 cm) from the bottom edge of the page for $8.5 \times 11$-inch paper;
for A4 paper, approximately $1\frac{5}{8}$ inches (4.13 cm) from the bottom edge of the
page.

\subsection{Margins and page numbering}

All printed material, including text, illustrations, and charts, must be kept
within a print area $6\frac{7}{8}$ inches (17.46 cm) wide by $8\frac{7}{8}$ inches (22.54 cm)
high.
Page numbers should be in the footer, centered and $\frac{3}{4}$ inches from the bottom of the page.
The review version should have page numbers, yet the final version submitted as camera ready should not show any page numbers.
The \LaTeX\ template takes care of this when used properly.



\subsection{Type style and fonts}

Wherever Times is specified, Times Roman may also be used.
If neither is available on your word processor, please use the font closest in
appearance to Times to which you have access.

MAIN TITLE.
Center the title $1\frac{3}{8}$ inches (3.49 cm) from the top edge of the first page.
The title should be in Times 14-point, boldface type.
Capitalize the first letter of nouns, pronouns, verbs, adjectives, and adverbs;
do not capitalize articles, coordinate conjunctions, or prepositions (unless the title begins with such a word).
Leave two blank lines after the title.

AUTHOR NAME(s) and AFFILIATION(s) are to be centered beneath the title
and printed in Times 12-point, non-boldface type.
This information is to be followed by two blank lines.

The ABSTRACT and MAIN TEXT are to be in a two-column format.

MAIN TEXT.
Type main text in 10-point Times, single-spaced.
Do NOT use double-spacing.
All paragraphs should be indented 1 pica (approx.~$\frac{1}{6}$ inch or 0.422 cm).
Make sure your text is fully justified---that is, flush left and flush right.
Please do not place any additional blank lines between paragraphs.

Figure and table captions should be 9-point Roman type as in \cref{fig:onecol,fig:short}.
Short captions should be centred.

\noindent Callouts should be 9-point Helvetica, non-boldface type.
Initially capitalize only the first word of section titles and first-, second-, and third-order headings.

FIRST-ORDER HEADINGS.
(For example, {\large \bf 1. Introduction}) should be Times 12-point boldface, initially capitalized, flush left, with one blank line before, and one blank line after.

SECOND-ORDER HEADINGS.
(For example, { \bf 1.1. Database elements}) should be Times 11-point boldface, initially capitalized, flush left, with one blank line before, and one after.
If you require a third-order heading (we discourage it), use 10-point Times, boldface, initially capitalized, flush left, preceded by one blank line, followed by a period and your text on the same line.

\subsection{Footnotes}

Please use footnotes\footnote{This is what a footnote looks like.
It often distracts the reader from the main flow of the argument.} sparingly.
Indeed, try to avoid footnotes altogether and include necessary peripheral observations in the text (within parentheses, if you prefer, as in this sentence).
If you wish to use a footnote, place it at the bottom of the column on the page on which it is referenced.
Use Times 8-point type, single-spaced.


\subsection{Cross-references}

For the benefit of author(s) and readers, please use the
{\small\begin{verbatim}
  \cref{...}
\end{verbatim}}  command for cross-referencing to figures, tables, equations, or sections.
This will automatically insert the appropriate label alongside the cross-reference as in this example:
\begin{quotation}
  To see how our method outperforms previous work, please see \cref{fig:onecol} and \cref{tab:example}.
  It is also possible to refer to multiple targets as once, \eg~to \cref{fig:onecol,fig:short-a}.
  You may also return to \cref{sec:formatting} or look at \cref{eq:also-important}.
\end{quotation}
If you do not wish to abbreviate the label, for example at the beginning of the sentence, you can use the
{\small\begin{verbatim}
  \Cref{...}
\end{verbatim}}
command. Here is an example:
\begin{quotation}
  \Cref{fig:onecol} is also quite important.
\end{quotation}

\subsection{References}

List and number all bibliographical references in 9-point Times, single-spaced, at the end of your paper.
When referenced in the text, enclose the citation number in square brackets, for
example~\cite{Authors14}.
Where appropriate, include page numbers and the name(s) of editors of referenced books.
When you cite multiple papers at once, please make sure that you cite them in numerical order like this \cite{Alpher02,Alpher03,Alpher05,Authors14b,Authors14}.
If you use the template as advised, this will be taken care of automatically.

\begin{table}
  \centering
  \begin{tabular}{@{}lc@{}}
    \toprule
    Method & Frobnability \\
    \midrule
    Theirs & Frumpy \\
    Yours & Frobbly \\
    Ours & Makes one's heart Frob\\
    \bottomrule
  \end{tabular}
  \caption{Results.   Ours is better.}
  \label{tab:example}
\end{table}

\subsection{Illustrations, graphs, and photographs}

All graphics should be centered.
In \LaTeX, avoid using the \texttt{center} environment for this purpose, as this adds potentially unwanted whitespace.
Instead use
{\small\begin{verbatim}
  \centering
\end{verbatim}}
at the beginning of your figure.
Please ensure that any point you wish to make is resolvable in a printed copy of the paper.
Resize fonts in figures to match the font in the body text, and choose line widths that render effectively in print.
Readers (and reviewers), even of an electronic copy, may choose to print your paper in order to read it.
You cannot insist that they do otherwise, and therefore must not assume that they can zoom in to see tiny details on a graphic.

When placing figures in \LaTeX, it's almost always best to use \verb+\includegraphics+, and to specify the figure width as a multiple of the line width as in the example below
{\small\begin{verbatim}
   \usepackage{graphicx} ...
   \includegraphics[width=0.8\linewidth]
                   {myfile.pdf}
\end{verbatim}
}


\subsection{Color}

Please refer to the author guidelines on the CVPR\ 2023\ web page for a discussion of the use of color in your document.

If you use color in your plots, please keep in mind that a significant subset of reviewers and readers may have a color vision deficiency; red-green blindness is the most frequent kind.
Hence avoid relying only on color as the discriminative feature in plots (such as red \vs green lines), but add a second discriminative feature to ease disambiguation.

\section{Final copy}

You must include your signed IEEE copyright release form when you submit your finished paper.
We MUST have this form before your paper can be published in the proceedings.

Please direct any questions to the production editor in charge of these proceedings at the IEEE Computer Society Press:
\url{https://www.computer.org/about/contact}.


{\small
\bibliographystyle{ieee_fullname}


\section{Introduction} \label{sec:intro}





\begin{figure}
    \centering
    \includegraphics[width=0.9\columnwidth]{figures/spider_all.png}
    \caption[]
    {Our sustainable geospatial foundation model (GFM) achieves strong performance on a broad set of tasks in comparison to other state-of-the-art geospatial pretraining methods (SeCo \cite{seco}, SatMAE \cite{satmae}) and ImageNet supervised pretraining baselines.} 
    \label{fig:model_comparison}
\end{figure}

With the rise of large-scale satellite and aerial imageries~\cite{landsat,naip}, geospatial technologies are becoming increasingly important. Progress in this domain can substantially improve our ability to understand the earth and how we interact with it.
Nonetheless, in the current era of deep learning, data fuels progress. Thankfully, the amount of data and tasks in the geospatial domain continues to grow. This has recently been made evident in a survey \cite{earthnets} compiling hundreds of published datasets for earth observation. Beyond curated datasets, openly available satellite imagery programs like Sentinel~\cite{sentinel}, Landsat~\cite{landsat}, and National Agriculture Imagery Program (NAIP)~\cite{naip} also provide a plethora of data for use. 

With such a vast interest in the application of geospatial and remote sensing data,
the computer vision community has been continually investing in designing better algorithms to harness the available data and improve performance on various tasks.
Particularly, with the rising popularity of foundation models in vision and natural language, many works have worked on building strong pretrained models specific to the geospatial domain \cite{seco, gassl, satmae, millionAID_supervised_pretraining}. 
These methods typically train a network from scratch on a large corpus of remote sensing imagery. Unfortunately, this can require a significant amount of data and training time to achieve good performance, especially when employing large state-of-the-art transformer models.

However, in this pursuit of a stronger geospatial foundation model, one potentially useful tool has been largely forgotten. ImageNet pretrained models are readily available for the majority of state-of-the-art architectures. More recently, we now have many models trained on the larger scale ImageNet-22k~\cite{imagenet} dataset, providing even stronger and more general representations than before.
Rather than beginning the pretraining process tabula rasa, could these ImageNet representations serve as a base on which stronger geospatial models can be built?
In the same spirit, continual pretraining has been practiced in the NLP domain with success in various works \cite{dontStopPretraining, continual_temporal, continual_mixedLang}. In this paradigm, existing foundation models are further improved for a specific domain or task through a secondary pretraining stage. In principle, we reason that such a paradigm has the potential to produce strong geospatial models in an efficient and sustainable manner.

To this end, we investigate a sustainable approach for building geospatial foundation models. Specifically, we form a multi-objective continual pretraining paradigm, simultaneously leveraging ImageNet pretrained features and self-supervised learning on a concise collection of geospatial imagery. In our investigations, we discover two important factors in the process.

\begin{itemize}
    \item \textbf{Data choice matters.} We find that the selection of pretraining data matters, even within the geospatial domain. Therefore, we select a diverse collection of data from various sources to capture a wider variety of general remote sensing scenes, which we term as GeoPile. Conducting masked image modeling with GeoPile is significantly more effective and sample efficient compared to other common alternatives (see Section \ref{sec:data}).
    \item \textbf{Continual pretraining.} Available pretrained models on diverse datasets like ImageNet-22k should not be ignored when building geospatial foundation models. Rather, by leveraging their representations, we can build strong models for geospatial applications in a sustainable manner. To this end, we investigate a multi-objective continual pretraining paradigm for simple and effective learning (see Section \ref{sec:gfm}). By continual pretraining, we show that our newly proposed Geospatial Foundation Model (GFM) outperforms previous state-of-the-art geospatial pretrained models on a broad set of tasks, as shown in Figure~\ref{fig:model_comparison}.
\end{itemize}


\section{Related Work} \label{sec:related}
\subsection{Masked Image Modeling}
Masked image modeling (MIM) has been proposed in various forms in recent years, but has quickly become very popular as an effective pretraining task. 
In general, the goal is to learn from data in a self-supervised manner by asking the model to generate pixel values for intentionally-withheld regions in an image.
\cite{context_encoders} is an early work with an aim of learning strong visual representations through inpainting masked regions. In \cite{generative_pretrain}, Chen et. al train a large transformer to predict pixels autoregressively. After the introduction of vision transformers (ViT) \cite{vit}, many works continued to improve various MIM variants. \cite{beit} and \cite{ibot} take inspiration from BERT \cite{bert} in natural language processing, and tokenize the image patches with either a pretrained model or jointly trained online tokenizer, with the objective being to reconstruct at a token-level rather than raw pixels. Recently, \cite{simmim} and \cite{mae} show that a masked image modeling task of simply regressing directly on the image pixels is sufficient and effective. In this work, we leverage the framework from \cite{simmim}, as it is compatible with hierarchical transformer architectures \cite{swin}.  

\subsection{Geospatial Pretraining}
Various works have experimented with employing supervised or self-supervised pretraining paradigms in the geospatial domain. The classical work of \cite{indomain}, and  more recent paper \cite{millionAID_supervised_pretraining}, investigate supervised pretraining on individual datasets of various sizes. Interestingly, these still often found the ImageNet pretrained models to perform very well, particularly with vision transformers \cite{vit, swin}.
Other works have explored self-supervised learning paradigms for remote sensing, primarily focused on contrastive methods. \cite{seco} and \cite{gassl} employ a MoCo \cite{mocov2} style objective using spatially aligned but temporally different images as the positive pairs. \cite{saumoco} and \cite{tile2vec} also utilize a MoCo-inspired objective, but specify a cropping procedure to generate positives and negatives within and across images. \cite{colorOutofSpace} employs a colorization objective on Sentinel-2 imagery utilizing the various spectral bands. Most recently, SatMAE \cite{satmae} explores the use of masked image modeling to train a large ViT model. This work is similar in some respect to ours, as we also train a transformer model with an MIM objective. However, we find that SatMAE often does not perform better than the off-the-shelf ImageNet-22k pretrained ViT (Section \ref{sec:experiments}). This indicates both the difficulty of building strong geospatial foundation models from scratch and highlights the potential usefulness of leveraging continual pretraining instead, as we investigate in this work.




In this work, we develop our pretraining objective based on a masked image modeling approach like \cite{simmim, mae}.
MIM has recently been shown to be particularly effective in the natural image domain, surpassing many contrastive works and being shown to be friendlier to downstream optimization \cite{simmim, mae, ibot, beit, dark_secrets}. Exploration of the masked image modeling framework in geospatial applications is still in its early stages, and could help allivate some concerns with contrastive approaches in this domain.
Particularly, the choice of augmentations with contrastive methods can be quite difficult, as common selections such as greyscale, color jitter and others that heavily affect the intensity of the image can instill undesirable invariances \cite{indomain}. On the other hand, MIM objectives like \cite{simmim, mae} rely only on simple spatial augmentations such as flipping and cropping. Furthermore, a common remote sensing application is that of change detection, which requires a model to detect changes in two images from the same location but at different times. In order to still be effective on this task, works that use contrastive approaches on temporal positives introduce various design choices. For instance, SeCo \cite{seco} creates multiple feature subspaces during pretraining, each one invariant to a separate form of augmentation. \cite{matter} also employs temporal positives, but instead chooses the sampling locations for the pretraining data to ensure that image pairs contain primarily natural illumination and viewing angle variant, without major changes such as new urban developments.
\begin{figure*}
    \centering
    \includegraphics[trim={160 160 160 160},clip, width=0.9\textwidth]{figures/samples.pdf}
    \caption[]
    {We visualize some example images from the pretraining datasets. From left to right: ImageNet, Sentinel-2, and GeoPile. Sentinel-2 has noticeably much lower feature diversity within a single image and across images than that of ImageNet or our GeoPile pretraining dataset.}
    \label{fig:data_comparison_visual}
\end{figure*}

\subsection{Continual Pretraining}
Continual pretraining has been primarily introduced in the natural language domain \cite{dontStopPretraining, continual_temporal, continual_mixedLang}, in order to improve large language models (LLM). \cite{dontStopPretraining} illustrates the viability of two additional stages of pretraining, using in-domain data (domain-adaptive), and then even further using task-specific data (task-adaptive). \cite{continual_temporal} proposes a continual training paradigm for enabling temporal reasoning abilities to pretrained language models. \cite{continual_mixedLang} focus on using continual pretraining to enable mixed language neural machine translation. In the vision domain, \cite{medseg} employs a BYOL \cite{byol} style continual pretraining paradigm for 2D medical image segmentation. \cite{selfimproveself} explores a hierarchical pretraining approach for task adaptation. However, they primarily focus on adapting to a specific downstream task at a time, employing three training stages on top of an existing pretrained model for each task individually. In contrast, we employ one efficient in-domain pretraining setting that can generalize to many downstream tasks, as illustrated in Section \ref{sec:experiments}. Furthermore, rather than directly loading the pretrained weights from existing models as initialization, we find instead that leveraging the representations as an auxiliary distillation objective during the pretraining process enables learning strong in-domain representations in a sustainable manner.

\section{Methodology}
We aim to investigate a sustainable approach for building geospatial foundation models. This leads us to two key insights. First, the selection of pretraining data matters, even within the geospatial domain. We discuss our empirical findings to this end in Section \ref{sec:data}. Second, available pretrained models on diverse datasets like ImageNet-22k should not be ignored when building geospatial foundation models. In fact, by leveraging their representations, we can build strong models for geospatial applications in a sustainable manner. These discussions can be found in Section \ref{sec:gfm}.
\subsection{Pre-training Data Matters} \label{sec:data}

A particularly common choice of source data among geospatial contrastive pretraining works is Sentinel-2 imagery \cite{seco, matter, colorOutofSpace} due to its large corpus of available data and ease of access.
Therefore, to begin our study, we first gather a pretraining dataset of ~1.3 million (matching the scale of ImageNet-1k~\cite{imagenet}) Sentinel-2 images using the sampling technique from \cite{seco}. 
After gathering the Sentinel-2 data, we employ it to pretrain a Swin-B \cite{swin} model with the masked image modeling (MIM) objective from \cite{simmim}. 
We then finetune and evaluate this model on a wide variety of downstream datasets to get a broad understanding of its performance potential in many tasks (see Section \ref{sec:experiments} for task details). For a comparison, we finetune the ImageNet-22k pretrained Swin-B from the official Swin Transformer repository \cite{swin} on all downstream tasks as a baseline. In order to compare these models across all tasks, we introduce an average relative performance metric (ARP) in which we take the relative percentage difference on each task with respect to the ImageNet-22k baseline, and then average that difference:
\begin{equation} \label{eg:arp}
    \text{ARP}(M) = \frac{1}{N}\sum_{i=1}^N \frac{\text{score}(M, \text{task}_i) -\text{score}( \text{baseline}, \text{task}_i)}{\text{score}( \text{baseline}, \text{task}_i)}.
\end{equation}
\noindent Here ``baseline'' is the Swin-B model pretrained on ImageNet-22k and finetuned on ImageNet-1k, as mentioned above. $M$ denotes the model for performance evaluation, and N is the number of tasks. There are $7$ tasks used in Section~\ref{sec:experiments} covering important geospatial tasks such as classification, multi-label classification, semantic segmentation, change detection, and super-resolution. The reported ARP value is scaled by 100 to show as a percentage.

We compare these two models in Table \ref{tab:data}.
Interestingly, we find that the Sentinel-2 model performs poorly on downstream tasks compared to the ImageNet-22k baseline.
To investigate this further, we also pretrain a model using MIM on ImageNet-1k,
and find this actually performs better than using Sentinel-2 imagery.
While there is obviously a degree of domain shift between ImageNet and remote sensing data, we reason that Sentinel-2 data alone lacks sufficient feature diversity for a strong pretraining dataset. As a basic indicator, we calculate the average image entropy over a randomly sampled set of 3000 images for both ImageNet-1k and our collected Sentinel-2 data and find it to be 5.1 and 3.9 respectively. Note that entropy is certainly not the sole factor at play, such an evaluation can still provides insights into the advantages of ImageNet over sentinel. For MIM objectives, training data with a substantially lower entropy can make for an easier reconstruction task, since masked regions may be more similar to their neighbors. Therefore, the network does not have to work as hard to fill in the blanks, limiting the learning potential. Qualitatively, we also visualize multiple samples from ImageNet-1k and Sentinel-2 in the top row of Figure \ref{fig:data_comparison_visual}. The feature diversity within a single image and across images of Sentinel-2 is perceivably lower than that of ImageNet. This result indicates that a comparatively narrow scope of features is provided to the model when pretraining with Sentinel-2.



Therefore, we set out to collect a diverse geospatial pretraining dataset. Sourcing from both labeled and unlabelled data, we form a new pretraining dataset which we term GeoPile. The breakdown of GeoPile is shown in Table \ref{tab:geopile}. For textural detail, we ensure a variety of ground sample distances (GSD), including images with much higher resolution imagery than Sentinel-2 (which has a GSD of 10m). Furthermore, the selected labeled datasets encompass a wide variety of classes from general remote sensing scenes, ensuring visual diversity across samples. We calculate the average entropy of our GeoPile dataset, and find it to be 4.6, much closer to that of ImageNet-1k. Furthermore, the textural and visual diversity is qualitatively evident in Figure \ref{fig:data_comparison_visual}. As shown in Table \ref{tab:data}, the enhancing effect of the data selection is evident in the substantial performance increase.

To further improve the performance of our pretrained model in comparison to the ImageNet-22k baseline, we increase the number of training epochs in the last rows of Table \ref{tab:data}. While we are able to make improvements, this comes at the cost of substantially more compute and carbon footprint for marginal gain. Therefore, we ask the question: can we significantly improve performance with minimal compute and carbon footprint overhead? To this end, we investigate a simple and sustainable approach for building geospatial foundation models with strong performance.









\begin{table}
    \caption{Dataset Analysis. To evaluate each method, we finetune the pretrained model on seven different tasks, outlined in Section \ref{sec:experiments} and report the ARP metric defined in Equation \ref{eg:arp}.
   
    Overall, our collected GeoPile pretraining dataset significantly improves downstream performance. To further improve the performance in a sustainable manner, we introduce our continuous training paradigm GFM. We show the ARP and CO2 estimations \cite{co2} for GFM trained on GeoPile.}
    \label{tab:data}
    \centering
    \setlength\tabcolsep{4.0pt} 
    \renewcommand{\arraystretch}{0.9
   
    \begin{tabular}{ccccc}
        \toprule
        Method & \# Images & Epochs & ARP $\uparrow$ & CO2 $\downarrow$\\
        \toprule
        ImageNet-22k Sup. & 14M & - & 0.0 & -\\
       
       
        \midrule
        ImageNet-1k & 1.3M & 100 & 1.82 & 17.76\\
        Sentinel-2 \cite{seco} & 1.3M & 100 & -5.53 & 17.76\\
       
       
       
       
       
       
        GeoPile & 600k & 200 & 2.02 & 12.64\\
        GeoPile & 600k & 800 & 2.44 & 50.56\\
        \midrule
        GFM & 600k & 100 & 4.47 & 8.56\\
        \midrule
    \end{tabular}
\end{table}

\begin{table}
    \caption{Breakdown of datasets in the GeoPile. We gather approximately 600k samples from a combination of labeled and unlabeled satellite imagery.}
    \label{tab:geopile}
    \centering
    \setlength\tabcolsep{5.0pt} 
    \renewcommand{\arraystretch}{0.9
   
    \begin{tabular}{cccc}
        \toprule
        Dataset & \# Images & GSD & \# Classes\\
        \toprule
        NAIP\cite{naip} & 300,000 & 1m & n/a\\
        RSD46-WHU \cite{RSD46-WHU} &  116,893 & 0.5m - 2m & 46\\
        MLRSNet \cite{MLRSNet} & 109,161 & 0.1m - 10m & 60\\
        RESISC45 \cite{RESISC45} & 31,500 & 0.2m - 30m & 45\\
        PatternNet \cite{PatternNet} & 30,400 & 0.1m - 0.8m & 38\\
        \midrule
    \end{tabular}
\end{table}
\subsection{Sustainable GFM} \label{sec:gfm}

\begin{figure*}
    \centering
    \includegraphics[trim={100 60 100 60},clip, width=0.75\textwidth]{figures/gfm.pdf}
    \caption[]
    {Our GFM multi-objective continual pretraining paradigm. Two parallel branches, one initialized with ImageNet-22k weights (top) and another from random initialization (bottom). 
   
    Blue functional blocks are frozen during training, and green ones are trained. In a teacher-student fashion, we leverage the intermediate features of an ImageNet-22k pretrained model to guide and quicken learning. Furthermore, we build in an MIM objective on the student branch to allow for learning valuable in-domain features directly from the geospatial data.} 
    \label{fig:gfm}
\end{figure*}

Nowadays, state-of-the-art models release pretrained weights on very large and diverse datasets like ImageNet-22k. While not perfect for geospatial tasks, these models still contain a vast amount of useful knowledge that is generalizable across many settings. 
Nonetheless, the majority of previous works in geospatial pretraining neglect the available ImageNet representations, which is not ideal, especially for large transformer models that are notoriously data hungry and computationally expensive to train.

Instead, we reason that the valuable knowledge available in these large-scale models should be leveraged to produce strong performance with minimized overhead. To this end, we propose an unsupervised multi-objective training paradigm for sustainable training of geospatial foundation models, illustrated in Figure \ref{fig:gfm}.

There are two main components in our framework. First, we randomly initialize an encoder and decoder set up for MIM as in \cite{simmim}. During training, the input is randomly masked, and the network attempts to reconstruct the image at the output. This MIM objective is enforced with an L1 loss. \cite{simmim}:
\begin{equation}
    \mathcal{L}_{MIM}=\frac{\left\|\mathbf{O}_\kappa-\mathbf{G}_\kappa\right\|_1}{N},
\end{equation}
where $\mathbf{O}_{\kappa}$ is the original pixel information from the masked regions, $\mathbf{G}_{\kappa}$ are the generated reconstructions for those masked regions, and $N$ is the total number of masked pixels.

To leverage the strong representations from the ImageNet model, we initialize a second encoder branch up to a chosen stage $L$ and load the pretrained weights. 
This branch will serve as a form of teacher during the training process to the other "student" branch, which will serve as our final GFM model. For the ImageNet teacher, we freeze the weight, to both ensure that the structured representations are maintained during the training process, and also reduce the computation required during optimization. 
 
Rather than using the masked input as in the student branch, the teacher receives the unmasked image as input,
and provides a feature output $f_{L}^{T}$ at stage $L$. This feature has access to the full context of the input, enabling it to capture informative representations.
We utilize this feature to guide the representations of the student, and form a secondary objective with the cosine similarity between branch features: 
\begin{equation}
    \mathcal{L}_{feat} =  -\frac{P(f_{L}^{S})}{\left\|P(f_{L}^{S})\right\|_2} \cdot \frac{f_{L}^{T}}{\left\|f_{L}^{T}\right\|_2},
\end{equation}
where $f_{L}^{S}$ and $f_{L}^{T}$ are the intermediate features of the student and teacher branches at stage $L$, and $P$ is an linear projection layer. Therefore, the final loss during training is simply the summation of these objectives:

\begin{equation} \label{eq:loss_combined}
    \mathcal{L} = \mathcal{L}_{MIM} + \alpha\mathcal{L}_{feat}.
\end{equation}
where $\alpha$ is a balancing term, which we find is best simply with $\alpha=1.0$ (Section \ref{sec:ablation}).
This training paradigm enables an ideal two-fold optimization. Distillation from the intermediate features of the teacher ensure that the student can benefit from the teacher's diverse knowledge, learning more in less time. Furthermore, the student is simultaneously given freedom to adapt to in-domain data through its own pretraining objective, gathering new domain-specific features to improve performance.

We analyze the ARP and sustainability potential of this approach in Table \ref{tab:data}. Notably, our GFM is able to achieve better overall performance with substantially less computation and emissions impact \footnote{CO2 estimations were completed with \url{https://mlco2.github.io/impact} from \cite{co2}} compared to tabula rasa pretraining with the same dataset, illustrating that our multi-objective continual pretraining paradigm is a sustainable method for training these models.


 



\section{Experiments} \label{sec:experiments}
To verify the effectiveness of our model in detail, we conduct experiments on eight geospatial datasets of various tasks including change detection (Section \ref{sec:change_det}), classification (Section \ref{sec:classification}), segmentation (Section \ref{sec:seg_detect}), and super-resolution (Section \ref{sec:superres}).

\subsection{Change Detection} \label{sec:change_det}
Change detection is a particularly important remote sensing task, helping us understand how humans interact with our planet over time, and natural phenomena that change our planet's landscape. We conduct experiments on both the Onera Satellite Change Detection (OSCD \cite{OSCD}) in Table \ref{tab:OSCD} and DSIFN \cite{DSIFN} in Table \ref{tab:DSFIN}.


OSCD consists of 14 image pairs extracted from various regions around the world within a three year period of 2015 to 2018. The images are taken from Sentinel-2 with GSDs ranging from 10m to 60m, and split into 14 images for training and 10 for evaluation. The annotations indicate whether the change has occurred on a pixel level, and focus primarily on urban developments. Similarly, we also test our method on DSIFN dataset. This dataset contains high-resolution imagery, such as WorldView-3 and GeoEys-1 \cite{DSIFN}. This dataset contains 3490 high resolution samples for training and 48 images for evaluation respectively. Every pair of images from a given location at two different timestamps will be fed into the swin encoder \cite{swin} for feature extraction. The difference between the features from each pair is computed and fed into to an UPerNet \cite{Upernet} to generate the final binary segmentation masks \cite{seco, siamdiff}. The encoder is initialized with the pretrained weights.


For both datasets, we report the precision, recall, and F1 score on the ``change" class. As the results presented from OSCD (Table \ref{tab:OSCD} and Figure \ref{fig:OSCD}) and DSIFN (Table \ref{tab:DSFIN}), GFM shows a consistent improvement over the ImageNet-22k baseline across both datasets. Notably, SatMAE is able to improve over its ImageNet-22k baseline on OSCD, but lags behind on DSIFN. This further highlights the difficulty of training large vision transformers from scratch that can perform consistently across different GSDs. 

\begin{table}
    \caption{Onera Satellite Change Detection Results}
    \label{tab:OSCD}
    \centering
    \setlength\tabcolsep{5.0pt} 
   
   
    \resizebox{\columnwidth}{!}{
    \begin{tabular}{cccc}
        \toprule
        Method & Precision $\uparrow$ & Recall $\uparrow$ & F1 $\uparrow$\\
        \toprule
        ResNet50 (ImageNet-1k) \cite{resnet} & \textbf{70.42} & 25.12 & 36.20\\
        SeCo \cite{seco} & 65.47 & 38.06 & 46.94\\
       
        MATTER \cite{matter} & 61.80 & 57.13 & 59.37\\
        ViT (ImageNet-22k) \cite{vit} & 48.34 & 22.52 & 30.73\\
        SatMAE \cite{satmae} & 48.19 & 42.24 & 45.02\\
        Swin (random)\cite{swin} & 51.80 & 47.69 & 49.66\\
        Swin (ImageNet-22k)\cite{swin} & 46.88 & 59.28 & 52.35\\
        \midrule
        GFM & 58.07 & \textbf{61.67} & \textbf{59.82}\\
        \midrule
    \end{tabular}
    }
\end{table}

\begin{table}
    \caption{DSFIN Change Detection Results}
    \label{tab:DSFIN}
    \centering
    \setlength\tabcolsep{5.0pt} 
   
   
    \resizebox{\columnwidth}{!}{
    \begin{tabular}{cccc}
        \toprule
        Method & Precision $\uparrow$ & Recall $\uparrow$ & F1 $\uparrow$\\
        \toprule
        ResNet50 (ImageNet-1k) \cite{resnet} & 28.74 & \textbf{92.07} & 43.80\\
        SeCo \cite{seco} & 39.68 & 81.02 & 53.27\\
       
       
        ViT (ImageNet-22k) \cite{vit} & 70.77 & 66.34 & 68.49\\
        SatMAE \cite{satmae} & 70.45 & 60.29 & 64.98\\
        Swin (random)\cite{swin} & 57.97 & 62.06 & 59.94\\
        Swin (ImageNet-22k)\cite{swin} & 67.11 & 72.33 & 69.62\\
        \midrule
        GFM & \textbf{74.83} & 67.98 & \textbf{71.24}\\
        \midrule
    \end{tabular}
    }
\end{table}

\begin{figure}
    \centering
    \includegraphics[width=0.8\columnwidth]{figures/OSCD.png}
    \caption[]
    {Qualitative results on OSCD. White, green, red colors represent true positive, false positive, and false negative respectively.} 
    \label{fig:OSCD}
\end{figure}

\subsection{Classification} \label{sec:classification}
Another common remote sensing application is that of classification. We evaluate two datasets common in the literature \cite{seco, matter}: UC Merced Land Use Dataset \cite{ucm} and BigEarthNet \cite{BEN}.
The UC Merced Land Use Dataset is a classic dataset in the remote sensing field. It contains 21 classes, each with 100 images at 256x256 pixels and an approximate GSD of 1 foot. We split the data into train and validation according to \cite{data_splits}.

BigEarthNet \cite{BEN} (BEN) is a large-scale remote sensing dataset for multi-label classification. The data consist of 12-band Sentinel-2 images with sizes of 120x120, 60x60, and 20x20 pixels for the bands at 10m, 20m, and 60m GSDs, respectively.
We employ the data split and 19 class evaluation as common in the literature \cite{indomain, seco, satmae}.

In Table \cite{BEN}, we report the classification accuracy on UC Merced (UCM) and mean average precision results on BigEarthNet (BEN) for all methods.
On UC Merced, we note the SeCo \cite{seco} pretrained model performs significantly worse than its ImageNet-1k pretrained counterpart with ResNet-50. 
These two datasets are very different in both classes, satellite source, and GSDs, and therefore having a diverse feature knowledge is imperative to maintaining performance despite these distinctions.
Our model can provide robust performance in both cases by leveraging ImageNet representation and remote sensing data in its learning. Furthermore, one key motivation for training a geospatial foundation model is to improve the sample efficiency for downstream tasks. Notably, we find that our model maintains strong performance on BigEarthNet, even when only given 1\% of the training data.









\begin{table}
    \caption{UC Merced classification accuracy and BigEarthNet multi-label classification mean average precision results on the validation set.}
    \label{tab:BEN}
    \centering
    \setlength\tabcolsep{5.0pt} 
   
   
    \resizebox{\columnwidth}{!}{
    \begin{tabular}{cccc}
        \toprule
        Method & UCM  & BEN 10\% & BEN 1\%\\
        \toprule
        ResNet50 (ImageNet-1k) \cite{resnet} & 98.8 & 80.0 & 41.3\\
        SeCo \cite{seco} & 97.1 & 82.6 & 63.6\\
       
       
        ViT (ImageNet-22k)\cite{vit} & 93.1 & 84.7 & 73.6\\
        SatMAE \cite{satmae} & 92.6 & 81.8 & 68.9\\
        Swin (random)\cite{swin} & 66.9 & 80.6 & 65.7\\
        Swin (ImageNet-22k) \cite{swin} & \textbf{99.0} & 85.7 & 79.5\\
        \midrule
       GFM & \textbf{99.0} & \textbf{86.3} & \textbf{80.7}\\
        \midrule
    \end{tabular}
    }
\end{table}


\subsection{Segmentation} \label{sec:seg_detect}
Segmentation is a popular remote sensing application for enabling automated extraction of building footprints or land cover mappings over wide regions. We therefore conduct experiments on this task on two different datasets.

Vaihingen \cite{vaihingen} is an urban semantic segmentation dataset collected over Vaihingen, Germany at a GSD of 0.9m. We employ the data split implemented in the MMSegmentation library \cite{mmseg} for our experiments, with 344 training and 398 for validation, all with an image size of 512x512 pixels. The WHU Aerial building \cite{whu} dataset is sampled over Christchurch, New Zealand at a GSD of 0.3m. Image tiles are provided at $512\times 512$ pixels, split into 4736 for training and 2416 for evaluation.

We report the intersect of union (IoU) segmentation results for all methods in Table \ref{tab:seg}. ImageNet pretrained models are notably strong performers in all cases. On both datasets, SeCo lags substantially behind its ImageNet counterpart. Interestingly, SatMAE is able to bring improvement over ImageNet-22k on WHU, but fails to do so to a larger degree on Vaihingen. 
However, our approach is able to leverage the already strong ImageNet-22k representations and guide them towards the geospatial domain, resulting in overall improvement.
\begin{table}
    \caption{Results on the WHU Aerial and Vaihingen segmentation datasets. We finetune all methods for 40k iterations, and report the IoU for the building class on WHU and mean IoU (mIoU) across the 6 classes (impervious surface, building, low vegetation, tree, car, clutter) of Vaihingen.}
    \label{tab:seg}
    \centering
    \setlength\tabcolsep{5.0pt} 
   
   
    \begin{tabular}{cccc}
        \toprule
        Method & WHU Aerial & Vaihingen\\
        \toprule
        ResNet50 (ImageNet-1k) \cite{resnet} & 88.5 & 74.0\\
        SeCo \cite{seco} & 86.7 & 68.9\\
       
       
        ViT (ImageNet-22k) \cite{vit} & 81.6 & 72.6\\
        SatMAE \cite{satmae} & 82.5 & 70.6 \\
        Swin (random) \cite{swin} & 88.2 & 67.0\\
        Swin (ImageNet-22k) \cite{swin} & 90.4 & 74.7 \\
        \midrule
        GFM & \textbf{90.7} & \textbf{75.3} \\
        \midrule
    \end{tabular}
\end{table}

\iffalse
\begin{table}
    \caption{(super resolution) Results}
    \label{tab:superres}
    \centering
    \setlength\tabcolsep{5.0pt} 
   
   
    \begin{tabular}{ccc}
        \toprule
        Method & PSNR & SSIM\\
        \toprule
       
       
       
       
        ViT (ImageNet-22k) & - & -\\
        SatMAE & - & -\\
        Swin (random) & - & -\\
        Swin (ImageNet-22k) & - & -\\
        \midrule
        GFM & - & -\\
        \midrule
    \end{tabular}
\end{table}
\fi


\begin{table}
    \caption{SpaceNet2 Super-resolution Results}
    \label{tab:spacenet}
    \centering
    \setlength\tabcolsep{5.0pt} 
   
   
    \begin{tabular}{ccc}
        \toprule
        Method & PSNR $\uparrow$ & SSIM $\uparrow$\\
        \toprule
       
       
       
       
       
        ViT (ImageNet-22k)\cite{vit} & \textbf{23.279} & 0.619 \\
        SatMAE \cite{satmae} & 22.742 & 0.621 \\
        Swin (random) \cite{swin} & 21.825 & 0.594 \\
        Swin (ImageNet-22k) \cite{swin} & 21.655 & 0.612 \\
       
        \midrule
        GFM & 22.599 & \textbf{0.638} \\
        \midrule
    \end{tabular}
\end{table}

\subsection{Super-resolution} \label{sec:superres}
In the previous experiments, we evaluated several common high-level tasks. Nonetheless, the low-level task of super-resolution is also important in the geospatial domain.
For this task, we repurpose the SpaceNet2 dataset, which contains 10,593 8-band images from four cities across the world: Las Vegas, Paris, Shanghai, and Khartoum. The data is provided at both a GSD of 1.24m (multi-spectral, 162x162 pixels) and 0.3m (pan-sharpened multispectral, 650x650 pixels). We formulate a super-resolution task, taking as input the 1.24m multi-spectral images and generating the 0.3m pan-sharpened equivalent. We evaluate the super-resolution performance of our model and several baselines with the peak signal-to-noise ratio (PSNR) and structural similarity index measure (SSIM) in Table \ref{tab:spacenet}.
The ViT-L ImageNet-22k model and our model are among the best in terms of PSNR and SSIM, respectively. Interestingly, SatMAE is not able to improve over this baseline. On the other hand, our method improves considerably over its ImageNet-22k baseline.

\subsection{Ablation Studies} \label{sec:ablation}

We perform multiple ablation studies on the choice of distillation stage, loss balancing term $\alpha$, and the GeoPile dataset components.

\subsection{Distillation Stage}
When implementing our feature map distillation objective, a natural question is at which point should the mappping take place. We experiment different locations by stage in the Swin transformer and calculate the corresponding ARP in Figure \ref{fig:ablation_plot}. Overall, performing the distillation after Stage 3 yields the highest ARP. Hence, we employ this scheme for all downstream experiments.
This result is also intuitively expected; distilling at Stage 3 gives a large portion of the model the supervisory signal from the teacher, while still allowing for purely domain-specific feature learning in the final layers.
\iffalse
\begin{table}
    \caption{Feature Distillation Ablation \textcolor{red}{Make this a bar plot}}
    \label{tab:dist_abl}
    \centering
    \setlength\tabcolsep{5.0pt} 
   
   
    \begin{tabular}{cc}
        \toprule
        Position & ARP $\uparrow$ \\
        \toprule
       
        \midrule
       
       
       
       
        Stage 1 & 0.75\\
        Stage 2 & 1.04\\
        Stage 3 & 2.39\\
        Stage 4 & 1.75\\
        \midrule
    \end{tabular}
\end{table}
\fi 
\begin{figure}
    \centering
    \includegraphics[width=0.8\columnwidth]{figures/ablation_plot.png}
    \caption[]
    {a) Distillation stage ablation results. b) $\alpha$ balancing tern ablation results.} 
    \label{fig:ablation_plot}
\end{figure}

\subsection{Balancing Term $\alpha$}
As discussed in Section \ref{sec:gfm}, our multi-objective loss in Equation \ref{eq:loss_combined} has the potential to use a balancing parameter $\alpha$. We ablate this parameter in Figure \ref{fig:ablation_plot} and report the corresponding ARP. Overall, we find that model with $\alpha=1.0$ performs the best.
\iffalse
\begin{table}
    \caption{Balancing Term $\alpha$ Ablation. \textcolor{red}{Make this a bar plot}}
    \label{tab:alpha_abl}
    \centering
    \setlength\tabcolsep{5.0pt} 
   
   
    \begin{tabular}{cc}
        \toprule
        $\alpha$ & ARP \\
        \toprule
       
       
       
       
       
       
        0.1 & 1.10\\
        0.5 & 1.64\\
        1.0 & 2.39\\
       
        \midrule
    \end{tabular}
\end{table}
\fi 
\subsection{GeoPile Pretraining Dataset}
To ablate components of the GeoPile, we remove each dataset individually to see its relative importance. Also, we compare using just the labeled data portion and using just the unlabeled NAIP imagery portion.

As expected, using just data from labeled datasets gives better performance with less images than using just images gathered from just NAIP. The human-curated samples in these datasets are more likely to contain relevant objects and features, as they each correspond to a particular class of interest. Still, unlabeled data like NAIP can be sourced easily and with scale. Further scaling of both labeled and unlabeled portions could further improve performance; however, it will also increase the training time and sustainability impact. Therefore, we maintain GeoPile at approximately 600,000 images.

\begin{table}
    \caption{GeoPile pretraining dataset ablation. We remove each dataset individually from GeoPile and report the number of images remaining and resulting ARP. The row ``w/o curated datasets" removes all data other than NAIP imagery.}
    \label{tab:data_ablation}
    \centering
    \setlength\tabcolsep{5.0pt} 
   
   
    \begin{tabular}{ccc}
        \toprule
        Data & \# Images & ARP $\uparrow$ \\
        \toprule
       
       
       
       
       
       
        w/o WHU-RSD46 & 444,061 & 2.87\\
        w/o MLRSNet & 451,793 & 3.30\\
        w/o Resisc45 & 529,454 & 2.72\\
        w/o PatternNet & 557,554 & 2.98\\
        w/o curated datasets & 300,000 & 1.62\\
        w/o NAIP & 260,954 & 2.65\\
       
        \midrule
    \end{tabular}
\end{table}

\subsection{Continual Pretraining Comparison}
In Table \ref{tab:init_ablation}, we compare our training paradigm with the vanilla continual pretraining approach of using the ImageNet-22k weights as initialization prior to beginning the pretraining step with GeoPile. We find this to be helpful in improving performance over simply starting from scratch. This validates the effectiveness of continual pretraining, even with simply initialization. However, the performance is still limited despite significant computation. On the other hand, our multi-objective pretraining paradigm significantly improves the overall performance with minimal computational needs and carbon impact. 
\begin{table}
    \caption{Continual pretraining comparison. In the first two rows, we experiment with simply initilizing the model with ImageNet-22k weights prior to conduction MIM training on GeoPile. However, our proposed GFM is both more effective and efficient.}
    \label{tab:init_ablation}
    \centering
    \setlength\tabcolsep{5.0pt} 
   
   
    \begin{tabular}{cccc}
        \toprule
        Method & Epochs & ARP $\uparrow$ & CO2 $\downarrow$\\
        \toprule
       
       
        ImageNet-22k Init. & 200 & 2.66 & 12.64\\
        ImageNet-22k Init. & 800 & 2.98 & 50.56\\
        GFM & 100 & 4.47 & 8.56\\
        \midrule
    \end{tabular}
\end{table}

\section{Conclusion}
In summary, this paper investigates a sustainable approach for building geospatial foundation models. To this end, we first construct a concise yet diverse collection of data from various sources for effective pretraining. Second, we propose a multi-objective continual pretraining paradigm, in which we leverage the strong representations of ImageNet-22k to guide and quicken learning, while simultaneously providing the freedom to learn valuable in-domain features through self-supervised learning on geospatial data.
We hope our GFM is one step forward in inspiring the path towards high-performing yet sustainable geospatial foundation models.
{\small
\bibliographystyle{ieee_fullname}


\section{Introduction} \label{sec:intro}





\begin{figure}
    \centering
    \includegraphics[width=0.9\columnwidth]{figures/spider_all.png}
    \caption[]
    {Our sustainable geospatial foundation model (GFM) achieves strong performance on a broad set of tasks in comparison to other state-of-the-art geospatial pretraining methods (SeCo \cite{seco}, SatMAE \cite{satmae}) and ImageNet supervised pretraining baselines.} 
    \label{fig:model_comparison}
\end{figure}

With the rise of large-scale satellite and aerial imageries~\cite{landsat,naip}, geospatial technologies are becoming increasingly important. Progress in this domain can substantially improve our ability to understand the earth and how we interact with it.
Nonetheless, in the current era of deep learning, data fuels progress. Thankfully, the amount of data and tasks in the geospatial domain continues to grow. This has recently been made evident in a survey \cite{earthnets} compiling hundreds of published datasets for earth observation. Beyond curated datasets, openly available satellite imagery programs like Sentinel~\cite{sentinel}, Landsat~\cite{landsat}, and National Agriculture Imagery Program (NAIP)~\cite{naip} also provide a plethora of data for use. 

With such a vast interest in the application of geospatial and remote sensing data,
the computer vision community has been continually investing in designing better algorithms to harness the available data and improve performance on various tasks.
Particularly, with the rising popularity of foundation models in vision and natural language, many works have worked on building strong pretrained models specific to the geospatial domain \cite{seco, gassl, satmae, millionAID_supervised_pretraining}. 
These methods typically train a network from scratch on a large corpus of remote sensing imagery. Unfortunately, this can require a significant amount of data and training time to achieve good performance, especially when employing large state-of-the-art transformer models.

However, in this pursuit of a stronger geospatial foundation model, one potentially useful tool has been largely forgotten. ImageNet pretrained models are readily available for the majority of state-of-the-art architectures. More recently, we now have many models trained on the larger scale ImageNet-22k~\cite{imagenet} dataset, providing even stronger and more general representations than before.
Rather than beginning the pretraining process tabula rasa, could these ImageNet representations serve as a base on which stronger geospatial models can be built?
In the same spirit, continual pretraining has been practiced in the NLP domain with success in various works \cite{dontStopPretraining, continual_temporal, continual_mixedLang}. In this paradigm, existing foundation models are further improved for a specific domain or task through a secondary pretraining stage. In principle, we reason that such a paradigm has the potential to produce strong geospatial models in an efficient and sustainable manner.

To this end, we investigate a sustainable approach for building geospatial foundation models. Specifically, we form a multi-objective continual pretraining paradigm, simultaneously leveraging ImageNet pretrained features and self-supervised learning on a concise collection of geospatial imagery. In our investigations, we discover two important factors in the process.

\begin{itemize}
    \item \textbf{Data choice matters.} We find that the selection of pretraining data matters, even within the geospatial domain. Therefore, we select a diverse collection of data from various sources to capture a wider variety of general remote sensing scenes, which we term as GeoPile. Conducting masked image modeling with GeoPile is significantly more effective and sample efficient compared to other common alternatives (see Section \ref{sec:data}).
    \item \textbf{Continual pretraining.} Available pretrained models on diverse datasets like ImageNet-22k should not be ignored when building geospatial foundation models. Rather, by leveraging their representations, we can build strong models for geospatial applications in a sustainable manner. To this end, we investigate a multi-objective continual pretraining paradigm for simple and effective learning (see Section \ref{sec:gfm}). By continual pretraining, we show that our newly proposed Geospatial Foundation Model (GFM) outperforms previous state-of-the-art geospatial pretrained models on a broad set of tasks, as shown in Figure~\ref{fig:model_comparison}.
\end{itemize}


\section{Related Work} \label{sec:related}
\subsection{Masked Image Modeling}
Masked image modeling (MIM) has been proposed in various forms in recent years, but has quickly become very popular as an effective pretraining task. 
In general, the goal is to learn from data in a self-supervised manner by asking the model to generate pixel values for intentionally-withheld regions in an image.
\cite{context_encoders} is an early work with an aim of learning strong visual representations through inpainting masked regions. In \cite{generative_pretrain}, Chen et. al train a large transformer to predict pixels autoregressively. After the introduction of vision transformers (ViT) \cite{vit}, many works continued to improve various MIM variants. \cite{beit} and \cite{ibot} take inspiration from BERT \cite{bert} in natural language processing, and tokenize the image patches with either a pretrained model or jointly trained online tokenizer, with the objective being to reconstruct at a token-level rather than raw pixels. Recently, \cite{simmim} and \cite{mae} show that a masked image modeling task of simply regressing directly on the image pixels is sufficient and effective. In this work, we leverage the framework from \cite{simmim}, as it is compatible with hierarchical transformer architectures \cite{swin}.  

\subsection{Geospatial Pretraining}
Various works have experimented with employing supervised or self-supervised pretraining paradigms in the geospatial domain. The classical work of \cite{indomain}, and  more recent paper \cite{millionAID_supervised_pretraining}, investigate supervised pretraining on individual datasets of various sizes. Interestingly, these still often found the ImageNet pretrained models to perform very well, particularly with vision transformers \cite{vit, swin}.
Other works have explored self-supervised learning paradigms for remote sensing, primarily focused on contrastive methods. \cite{seco} and \cite{gassl} employ a MoCo \cite{mocov2} style objective using spatially aligned but temporally different images as the positive pairs. \cite{saumoco} and \cite{tile2vec} also utilize a MoCo-inspired objective, but specify a cropping procedure to generate positives and negatives within and across images. \cite{colorOutofSpace} employs a colorization objective on Sentinel-2 imagery utilizing the various spectral bands. Most recently, SatMAE \cite{satmae} explores the use of masked image modeling to train a large ViT model. This work is similar in some respect to ours, as we also train a transformer model with an MIM objective. However, we find that SatMAE often does not perform better than the off-the-shelf ImageNet-22k pretrained ViT (Section \ref{sec:experiments}). This indicates both the difficulty of building strong geospatial foundation models from scratch and highlights the potential usefulness of leveraging continual pretraining instead, as we investigate in this work.




In this work, we develop our pretraining objective based on a masked image modeling approach like \cite{simmim, mae}.
MIM has recently been shown to be particularly effective in the natural image domain, surpassing many contrastive works and being shown to be friendlier to downstream optimization \cite{simmim, mae, ibot, beit, dark_secrets}. Exploration of the masked image modeling framework in geospatial applications is still in its early stages, and could help allivate some concerns with contrastive approaches in this domain.
Particularly, the choice of augmentations with contrastive methods can be quite difficult, as common selections such as greyscale, color jitter and others that heavily affect the intensity of the image can instill undesirable invariances \cite{indomain}. On the other hand, MIM objectives like \cite{simmim, mae} rely only on simple spatial augmentations such as flipping and cropping. Furthermore, a common remote sensing application is that of change detection, which requires a model to detect changes in two images from the same location but at different times. In order to still be effective on this task, works that use contrastive approaches on temporal positives introduce various design choices. For instance, SeCo \cite{seco} creates multiple feature subspaces during pretraining, each one invariant to a separate form of augmentation. \cite{matter} also employs temporal positives, but instead chooses the sampling locations for the pretraining data to ensure that image pairs contain primarily natural illumination and viewing angle variant, without major changes such as new urban developments.
\begin{figure*}
    \centering
    \includegraphics[trim={160 160 160 160},clip, width=0.9\textwidth]{figures/samples.pdf}
    \caption[]
    {We visualize some example images from the pretraining datasets. From left to right: ImageNet, Sentinel-2, and GeoPile. Sentinel-2 has noticeably much lower feature diversity within a single image and across images than that of ImageNet or our GeoPile pretraining dataset.}
    \label{fig:data_comparison_visual}
\end{figure*}

\subsection{Continual Pretraining}
Continual pretraining has been primarily introduced in the natural language domain \cite{dontStopPretraining, continual_temporal, continual_mixedLang}, in order to improve large language models (LLM). \cite{dontStopPretraining} illustrates the viability of two additional stages of pretraining, using in-domain data (domain-adaptive), and then even further using task-specific data (task-adaptive). \cite{continual_temporal} proposes a continual training paradigm for enabling temporal reasoning abilities to pretrained language models. \cite{continual_mixedLang} focus on using continual pretraining to enable mixed language neural machine translation. In the vision domain, \cite{medseg} employs a BYOL \cite{byol} style continual pretraining paradigm for 2D medical image segmentation. \cite{selfimproveself} explores a hierarchical pretraining approach for task adaptation. However, they primarily focus on adapting to a specific downstream task at a time, employing three training stages on top of an existing pretrained model for each task individually. In contrast, we employ one efficient in-domain pretraining setting that can generalize to many downstream tasks, as illustrated in Section \ref{sec:experiments}. Furthermore, rather than directly loading the pretrained weights from existing models as initialization, we find instead that leveraging the representations as an auxiliary distillation objective during the pretraining process enables learning strong in-domain representations in a sustainable manner.

\section{Methodology}
We aim to investigate a sustainable approach for building geospatial foundation models. This leads us to two key insights. First, the selection of pretraining data matters, even within the geospatial domain. We discuss our empirical findings to this end in Section \ref{sec:data}. Second, available pretrained models on diverse datasets like ImageNet-22k should not be ignored when building geospatial foundation models. In fact, by leveraging their representations, we can build strong models for geospatial applications in a sustainable manner. These discussions can be found in Section \ref{sec:gfm}.
\subsection{Pre-training Data Matters} \label{sec:data}

A particularly common choice of source data among geospatial contrastive pretraining works is Sentinel-2 imagery \cite{seco, matter, colorOutofSpace} due to its large corpus of available data and ease of access.
Therefore, to begin our study, we first gather a pretraining dataset of ~1.3 million (matching the scale of ImageNet-1k~\cite{imagenet}) Sentinel-2 images using the sampling technique from \cite{seco}. 
After gathering the Sentinel-2 data, we employ it to pretrain a Swin-B \cite{swin} model with the masked image modeling (MIM) objective from \cite{simmim}. 
We then finetune and evaluate this model on a wide variety of downstream datasets to get a broad understanding of its performance potential in many tasks (see Section \ref{sec:experiments} for task details). For a comparison, we finetune the ImageNet-22k pretrained Swin-B from the official Swin Transformer repository \cite{swin} on all downstream tasks as a baseline. In order to compare these models across all tasks, we introduce an average relative performance metric (ARP) in which we take the relative percentage difference on each task with respect to the ImageNet-22k baseline, and then average that difference:
\begin{equation} \label{eg:arp}
    \text{ARP}(M) = \frac{1}{N}\sum_{i=1}^N \frac{\text{score}(M, \text{task}_i) -\text{score}( \text{baseline}, \text{task}_i)}{\text{score}( \text{baseline}, \text{task}_i)}.
\end{equation}
\noindent Here ``baseline'' is the Swin-B model pretrained on ImageNet-22k and finetuned on ImageNet-1k, as mentioned above. $M$ denotes the model for performance evaluation, and N is the number of tasks. There are $7$ tasks used in Section~\ref{sec:experiments} covering important geospatial tasks such as classification, multi-label classification, semantic segmentation, change detection, and super-resolution. The reported ARP value is scaled by 100 to show as a percentage.

We compare these two models in Table \ref{tab:data}.
Interestingly, we find that the Sentinel-2 model performs poorly on downstream tasks compared to the ImageNet-22k baseline.
To investigate this further, we also pretrain a model using MIM on ImageNet-1k,
and find this actually performs better than using Sentinel-2 imagery.
While there is obviously a degree of domain shift between ImageNet and remote sensing data, we reason that Sentinel-2 data alone lacks sufficient feature diversity for a strong pretraining dataset. As a basic indicator, we calculate the average image entropy over a randomly sampled set of 3000 images for both ImageNet-1k and our collected Sentinel-2 data and find it to be 5.1 and 3.9 respectively. Note that entropy is certainly not the sole factor at play, such an evaluation can still provides insights into the advantages of ImageNet over sentinel. For MIM objectives, training data with a substantially lower entropy can make for an easier reconstruction task, since masked regions may be more similar to their neighbors. Therefore, the network does not have to work as hard to fill in the blanks, limiting the learning potential. Qualitatively, we also visualize multiple samples from ImageNet-1k and Sentinel-2 in the top row of Figure \ref{fig:data_comparison_visual}. The feature diversity within a single image and across images of Sentinel-2 is perceivably lower than that of ImageNet. This result indicates that a comparatively narrow scope of features is provided to the model when pretraining with Sentinel-2.



Therefore, we set out to collect a diverse geospatial pretraining dataset. Sourcing from both labeled and unlabelled data, we form a new pretraining dataset which we term GeoPile. The breakdown of GeoPile is shown in Table \ref{tab:geopile}. For textural detail, we ensure a variety of ground sample distances (GSD), including images with much higher resolution imagery than Sentinel-2 (which has a GSD of 10m). Furthermore, the selected labeled datasets encompass a wide variety of classes from general remote sensing scenes, ensuring visual diversity across samples. We calculate the average entropy of our GeoPile dataset, and find it to be 4.6, much closer to that of ImageNet-1k. Furthermore, the textural and visual diversity is qualitatively evident in Figure \ref{fig:data_comparison_visual}. As shown in Table \ref{tab:data}, the enhancing effect of the data selection is evident in the substantial performance increase.

To further improve the performance of our pretrained model in comparison to the ImageNet-22k baseline, we increase the number of training epochs in the last rows of Table \ref{tab:data}. While we are able to make improvements, this comes at the cost of substantially more compute and carbon footprint for marginal gain. Therefore, we ask the question: can we significantly improve performance with minimal compute and carbon footprint overhead? To this end, we investigate a simple and sustainable approach for building geospatial foundation models with strong performance.









\begin{table}
    \caption{Dataset Analysis. To evaluate each method, we finetune the pretrained model on seven different tasks, outlined in Section \ref{sec:experiments} and report the ARP metric defined in Equation \ref{eg:arp}.
   
    Overall, our collected GeoPile pretraining dataset significantly improves downstream performance. To further improve the performance in a sustainable manner, we introduce our continuous training paradigm GFM. We show the ARP and CO2 estimations \cite{co2} for GFM trained on GeoPile.}
    \label{tab:data}
    \centering
    \setlength\tabcolsep{4.0pt} 
   
   
    \begin{tabular}{ccccc}
        \toprule
        Method & \# Images & Epochs & ARP $\uparrow$ & CO2 $\downarrow$\\
        \toprule
        ImageNet-22k Sup. & 14M & - & 0.0 & -\\
       
       
        \midrule
        ImageNet-1k & 1.3M & 100 & 1.82 & 17.76\\
        Sentinel-2 \cite{seco} & 1.3M & 100 & -5.53 & 17.76\\
       
       
       
       
       
       
        GeoPile & 600k & 200 & 2.02 & 12.64\\
        GeoPile & 600k & 800 & 2.44 & 50.56\\
        \midrule
        GFM & 600k & 100 & 4.47 & 8.56\\
        \midrule
    \end{tabular}
\end{table}

\begin{table}
    \caption{Breakdown of datasets in the GeoPile. We gather approximately 600k samples from a combination of labeled and unlabeled satellite imagery.}
    \label{tab:geopile}
    \centering
    \setlength\tabcolsep{5.0pt} 
   
   
    \begin{tabular}{cccc}
        \toprule
        Dataset & \# Images & GSD & \# Classes\\
        \toprule
        NAIP\cite{naip} & 300,000 & 1m & n/a\\
        RSD46-WHU \cite{RSD46-WHU} &  116,893 & 0.5m - 2m & 46\\
        MLRSNet \cite{MLRSNet} & 109,161 & 0.1m - 10m & 60\\
        RESISC45 \cite{RESISC45} & 31,500 & 0.2m - 30m & 45\\
        PatternNet \cite{PatternNet} & 30,400 & 0.1m - 0.8m & 38\\
        \midrule
    \end{tabular}
\end{table}
\subsection{Sustainable GFM} \label{sec:gfm}

\begin{figure*}
    \centering
    \includegraphics[trim={100 60 100 60},clip, width=0.75\textwidth]{figures/gfm.pdf}
    \caption[]
    {Our GFM multi-objective continual pretraining paradigm. Two parallel branches, one initialized with ImageNet-22k weights (top) and another from random initialization (bottom). 
   
    Blue functional blocks are frozen during training, and green ones are trained. In a teacher-student fashion, we leverage the intermediate features of an ImageNet-22k pretrained model to guide and quicken learning. Furthermore, we build in an MIM objective on the student branch to allow for learning valuable in-domain features directly from the geospatial data.} 
    \label{fig:gfm}
\end{figure*}

Nowadays, state-of-the-art models release pretrained weights on very large and diverse datasets like ImageNet-22k. While not perfect for geospatial tasks, these models still contain a vast amount of useful knowledge that is generalizable across many settings. 
Nonetheless, the majority of previous works in geospatial pretraining neglect the available ImageNet representations, which is not ideal, especially for large transformer models that are notoriously data hungry and computationally expensive to train.

Instead, we reason that the valuable knowledge available in these large-scale models should be leveraged to produce strong performance with minimized overhead. To this end, we propose an unsupervised multi-objective training paradigm for sustainable training of geospatial foundation models, illustrated in Figure \ref{fig:gfm}.

There are two main components in our framework. First, we randomly initialize an encoder and decoder set up for MIM as in \cite{simmim}. During training, the input is randomly masked, and the network attempts to reconstruct the image at the output. This MIM objective is enforced with an L1 loss. \cite{simmim}:
\begin{equation}
    \mathcal{L}_{MIM}=\frac{\left\|\mathbf{O}_\kappa-\mathbf{G}_\kappa\right\|_1}{N},
\end{equation}
where $\mathbf{O}_{\kappa}$ is the original pixel information from the masked regions, $\mathbf{G}_{\kappa}$ are the generated reconstructions for those masked regions, and $N$ is the total number of masked pixels.

To leverage the strong representations from the ImageNet model, we initialize a second encoder branch up to a chosen stage $L$ and load the pretrained weights. 
This branch will serve as a form of teacher during the training process to the other "student" branch, which will serve as our final GFM model. For the ImageNet teacher, we freeze the weight, to both ensure that the structured representations are maintained during the training process, and also reduce the computation required during optimization. 
 
Rather than using the masked input as in the student branch, the teacher receives the unmasked image as input,
and provides a feature output $f_{L}^{T}$ at stage $L$. This feature has access to the full context of the input, enabling it to capture informative representations.
We utilize this feature to guide the representations of the student, and form a secondary objective with the cosine similarity between branch features: 
\begin{equation}
    \mathcal{L}_{feat} =  -\frac{P(f_{L}^{S})}{\left\|P(f_{L}^{S})\right\|_2} \cdot \frac{f_{L}^{T}}{\left\|f_{L}^{T}\right\|_2},
\end{equation}
where $f_{L}^{S}$ and $f_{L}^{T}$ are the intermediate features of the student and teacher branches at stage $L$, and $P$ is an linear projection layer. Therefore, the final loss during training is simply the summation of these objectives:

\begin{equation} \label{eq:loss_combined}
    \mathcal{L} = \mathcal{L}_{MIM} + \alpha\mathcal{L}_{feat}.
\end{equation}
where $\alpha$ is a balancing term, which we find is best simply with $\alpha=1.0$ (Section \ref{sec:ablation}).
This training paradigm enables an ideal two-fold optimization. Distillation from the intermediate features of the teacher ensure that the student can benefit from the teacher's diverse knowledge, learning more in less time. Furthermore, the student is simultaneously given freedom to adapt to in-domain data through its own pretraining objective, gathering new domain-specific features to improve performance.

We analyze the ARP and sustainability potential of this approach in Table \ref{tab:data}. Notably, our GFM is able to achieve better overall performance with substantially less computation and emissions impact \footnote{CO2 estimations were completed with \url{https://mlco2.github.io/impact#compute} from \cite{co2}} compared to tabula rasa pretraining with the same dataset, illustrating that our multi-objective continual pretraining paradigm is a sustainable method for training these models.


 



\section{Experiments} \label{sec:experiments}
To verify the effectiveness of our model in detail, we conduct experiments on eight geospatial datasets of various tasks including change detection (Section \ref{sec:change_det}), classification (Section \ref{sec:classification}), segmentation (Section \ref{sec:seg_detect}), and super-resolution (Section \ref{sec:superres}).

\subsection{Change Detection} \label{sec:change_det}
Change detection is a particularly important remote sensing task, helping us understand how humans interact with our planet over time, and natural phenomena that change our planet's landscape. We conduct experiments on both the Onera Satellite Change Detection (OSCD \cite{OSCD}) in Table \ref{tab:OSCD} and DSIFN \cite{DSIFN} in Table \ref{tab:DSFIN}.


OSCD consists of 14 image pairs extracted from various regions around the world within a three year period of 2015 to 2018. The images are taken from Sentinel-2 with GSDs ranging from 10m to 60m, and split into 14 images for training and 10 for evaluation. The annotations indicate whether the change has occurred on a pixel level, and focus primarily on urban developments. Similarly, we also test our method on DSIFN dataset. This dataset contains high-resolution imagery, such as WorldView-3 and GeoEys-1 \cite{DSIFN}. This dataset contains 3490 high resolution samples for training and 48 images for evaluation respectively. Every pair of images from a given location at two different timestamps will be fed into the swin encoder \cite{swin} for feature extraction. The difference between the features from each pair is computed and fed into to an UPerNet \cite{Upernet} to generate the final binary segmentation masks \cite{seco, siamdiff}. The encoder is initialized with the pretrained weights.


For both datasets, we report the precision, recall, and F1 score on the ``change" class. As the results presented from OSCD (Table \ref{tab:OSCD} and Figure \ref{fig:OSCD}) and DSIFN (Table \ref{tab:DSFIN}), GFM shows a consistent improvement over the ImageNet-22k baseline across both datasets. Notably, SatMAE is able to improve over its ImageNet-22k baseline on OSCD, but lags behind on DSIFN. This further highlights the difficulty of training large vision transformers from scratch that can perform consistently across different GSDs. 

\begin{table}
    \caption{Onera Satellite Change Detection Results}
    \label{tab:OSCD}
    \centering
    \setlength\tabcolsep{5.0pt} 
   
   
    \resizebox{\columnwidth}{!}{
    \begin{tabular}{cccc}
        \toprule
        Method & Precision $\uparrow$ & Recall $\uparrow$ & F1 $\uparrow$\\
        \toprule
        ResNet50 (ImageNet-1k) \cite{resnet} & \textbf{70.42} & 25.12 & 36.20\\
        SeCo \cite{seco} & 65.47 & 38.06 & 46.94\\
       
        MATTER \cite{matter} & 61.80 & 57.13 & 59.37\\
        ViT (ImageNet-22k) \cite{vit} & 48.34 & 22.52 & 30.73\\
        SatMAE \cite{satmae} & 48.19 & 42.24 & 45.02\\
        Swin (random)\cite{swin} & 51.80 & 47.69 & 49.66\\
        Swin (ImageNet-22k)\cite{swin} & 46.88 & 59.28 & 52.35\\
        \midrule
        GFM & 58.07 & \textbf{61.67} & \textbf{59.82}\\
        \midrule
    \end{tabular}
    }
\end{table}

\begin{table}
    \caption{DSFIN Change Detection Results}
    \label{tab:DSFIN}
    \centering
    \setlength\tabcolsep{5.0pt} 
   
   
    \resizebox{\columnwidth}{!}{
    \begin{tabular}{cccc}
        \toprule
        Method & Precision $\uparrow$ & Recall $\uparrow$ & F1 $\uparrow$\\
        \toprule
        ResNet50 (ImageNet-1k) \cite{resnet} & 28.74 & \textbf{92.07} & 43.80\\
        SeCo \cite{seco} & 39.68 & 81.02 & 53.27\\
       
       
        ViT (ImageNet-22k) \cite{vit} & 70.77 & 66.34 & 68.49\\
        SatMAE \cite{satmae} & 70.45 & 60.29 & 64.98\\
        Swin (random)\cite{swin} & 57.97 & 62.06 & 59.94\\
        Swin (ImageNet-22k)\cite{swin} & 67.11 & 72.33 & 69.62\\
        \midrule
        GFM & \textbf{74.83} & 67.98 & \textbf{71.24}\\
        \midrule
    \end{tabular}
    }
\end{table}

\begin{figure}
    \centering
    \includegraphics[width=0.8\columnwidth]{figures/OSCD.png}
    \caption[]
    {Qualitative results on OSCD. White, green, red colors represent true positive, false positive, and false negative respectively.} 
    \label{fig:OSCD}
\end{figure}

\subsection{Classification} \label{sec:classification}
Another common remote sensing application is that of classification. We evaluate two datasets common in the literature \cite{seco, matter}: UC Merced Land Use Dataset \cite{ucm} and BigEarthNet \cite{BEN}.
The UC Merced Land Use Dataset is a classic dataset in the remote sensing field. It contains 21 classes, each with 100 images at 256x256 pixels and an approximate GSD of 1 foot. We split the data into train and validation according to \cite{data_splits}.

BigEarthNet \cite{BEN} (BEN) is a large-scale remote sensing dataset for multi-label classification. The data consist of 12-band Sentinel-2 images with sizes of 120x120, 60x60, and 20x20 pixels for the bands at 10m, 20m, and 60m GSDs, respectively.
We employ the data split and 19 class evaluation as common in the literature \cite{indomain, seco, satmae}.

In Table \cite{BEN}, we report the classification accuracy on UC Merced (UCM) and mean average precision results on BigEarthNet (BEN) for all methods.
On UC Merced, we note the SeCo \cite{seco} pretrained model performs significantly worse than its ImageNet-1k pretrained counterpart with ResNet-50. 
These two datasets are very different in both classes, satellite source, and GSDs, and therefore having a diverse feature knowledge is imperative to maintaining performance despite these distinctions.
Our model can provide robust performance in both cases by leveraging ImageNet representation and remote sensing data in its learning. Furthermore, one key motivation for training a geospatial foundation model is to improve the sample efficiency for downstream tasks. Notably, we find that our model maintains strong performance on BigEarthNet, even when only given 1\% of the training data.









\begin{table}
    \caption{UC Merced classification accuracy and BigEarthNet multi-label classification mean average precision results on the validation set.}
    \label{tab:BEN}
    \centering
    \setlength\tabcolsep{5.0pt} 
   
   
    \resizebox{\columnwidth}{!}{
    \begin{tabular}{cccc}
        \toprule
        Method & UCM  & BEN 10\% & BEN 1\%\\
        \toprule
        ResNet50 (ImageNet-1k) \cite{resnet} & 98.8 & 80.0 & 41.3\\
        SeCo \cite{seco} & 97.1 & 82.6 & 63.6\\
       
       
        ViT (ImageNet-22k)\cite{vit} & 93.1 & 84.7 & 73.6\\
        SatMAE \cite{satmae} & 92.6 & 81.8 & 68.9\\
        Swin (random)\cite{swin} & 66.9 & 80.6 & 65.7\\
        Swin (ImageNet-22k) \cite{swin} & \textbf{99.0} & 85.7 & 79.5\\
        \midrule
       GFM & \textbf{99.0} & \textbf{86.3} & \textbf{80.7}\\
        \midrule
    \end{tabular}
    }
\end{table}


\subsection{Segmentation} \label{sec:seg_detect}
Segmentation is a popular remote sensing application for enabling automated extraction of building footprints or land cover mappings over wide regions. We therefore conduct experiments on this task on two different datasets.

Vaihingen \cite{vaihingen} is an urban semantic segmentation dataset collected over Vaihingen, Germany at a GSD of 0.9m. We employ the data split implemented in the MMSegmentation library \cite{mmseg} for our experiments, with 344 training and 398 for validation, all with an image size of 512x512 pixels. The WHU Aerial building \cite{whu} dataset is sampled over Christchurch, New Zealand at a GSD of 0.3m. Image tiles are provided at $512\times 512$ pixels, split into 4736 for training and 2416 for evaluation.

We report the intersect of union (IoU) segmentation results for all methods in Table \ref{tab:seg}. ImageNet pretrained models are notably strong performers in all cases. On both datasets, SeCo lags substantially behind its ImageNet counterpart. Interestingly, SatMAE is able to bring improvement over ImageNet-22k on WHU, but fails to do so to a larger degree on Vaihingen. 
However, our approach is able to leverage the already strong ImageNet-22k representations and guide them towards the geospatial domain, resulting in overall improvement.
\begin{table}
    \caption{Results on the WHU Aerial and Vaihingen segmentation datasets. We finetune all methods for 40k iterations, and report the IoU for the building class on WHU and mean IoU (mIoU) across the 6 classes (impervious surface, building, low vegetation, tree, car, clutter) of Vaihingen.}
    \label{tab:seg}
    \centering
    \setlength\tabcolsep{5.0pt} 
   
   
    \begin{tabular}{cccc}
        \toprule
        Method & WHU Aerial & Vaihingen\\
        \toprule
        ResNet50 (ImageNet-1k) \cite{resnet} & 88.5 & 74.0\\
        SeCo \cite{seco} & 86.7 & 68.9\\
       
       
        ViT (ImageNet-22k) \cite{vit} & 81.6 & 72.6\\
        SatMAE \cite{satmae} & 82.5 & 70.6 \\
        Swin (random) \cite{swin} & 88.2 & 67.0\\
        Swin (ImageNet-22k) \cite{swin} & 90.4 & 74.7 \\
        \midrule
        GFM & \textbf{90.7} & \textbf{75.3} \\
        \midrule
    \end{tabular}
\end{table}

\subsection{Super-resolution} \label{sec:superres}
In the previous experiments, we evaluated several common high-level tasks. Nonetheless, the low-level task of super-resolution is also important in the geospatial domain.
For this task, we repurpose the SpaceNet2 dataset, which contains 10,593 8-band images from four cities across the world: Las Vegas, Paris, Shanghai, and Khartoum. The data is provided at both a GSD of 1.24m (multi-spectral, 162x162 pixels) and 0.3m (pan-sharpened multispectral, 650x650 pixels). We formulate a super-resolution task, taking as input the 1.24m multi-spectral images and generating the 0.3m pan-sharpened equivalent. We evaluate the super-resolution performance of our model and several baselines with the peak signal-to-noise ratio (PSNR) and structural similarity index measure (SSIM) in Table \ref{tab:spacenet}.
The ViT-L ImageNet-22k model and our model are among the best in terms of PSNR and SSIM, respectively. Interestingly, SatMAE is not able to improve over this baseline. On the other hand, our method improves considerably over its ImageNet-22k baseline.

\iffalse
\begin{table}
    \caption{(super resolution) Results}
    \label{tab:superres}
    \centering
    \setlength\tabcolsep{5.0pt} 
   
   
    \begin{tabular}{ccc}
        \toprule
        Method & PSNR & SSIM\\
        \toprule
       
       
       
       
        ViT (ImageNet-22k) & - & -\\
        SatMAE & - & -\\
        Swin (random) & - & -\\
        Swin (ImageNet-22k) & - & -\\
        \midrule
        GFM & - & -\\
        \midrule
    \end{tabular}
\end{table}
\fi


\begin{table}
    \caption{SpaceNet2 Super-resolution Results}
    \label{tab:spacenet}
    \centering
    \setlength\tabcolsep{5.0pt} 
   
   
    \begin{tabular}{ccc}
        \toprule
        Method & PSNR $\uparrow$ & SSIM $\uparrow$\\
        \toprule
       
       
       
       
       
        ViT (ImageNet-22k)\cite{vit} & \textbf{23.279} & 0.619 \\
        SatMAE \cite{satmae} & 22.742 & 0.621 \\
        Swin (random) \cite{swin} & 21.825 & 0.594 \\
        Swin (ImageNet-22k) \cite{swin} & 21.655 & 0.612 \\
       
        \midrule
        GFM & 22.599 & \textbf{0.638} \\
        \midrule
    \end{tabular}
\end{table}

\subsection{Ablation Studies} \label{sec:ablation}

We perform multiple ablation studies on the choice of distillation stage, loss balancing term $\alpha$, and the GeoPile dataset components.

\subsection{Distillation Stage}
When implementing our feature map distillation objective, a natural question is at which point should the mappping take place. We experiment different locations by stage in the Swin transformer and calculate the corresponding ARP in Figure \ref{fig:ablation_plot}. Overall, performing the distillation after Stage 3 yields the highest ARP. Hence, we employ this scheme for all downstream experiments.
This result is also intuitively expected; distilling at Stage 3 gives a large portion of the model the supervisory signal from the teacher, while still allowing for purely domain-specific feature learning in the final layers.
\iffalse
\begin{table}
    \caption{Feature Distillation Ablation \textcolor{red}{Make this a bar plot}}
    \label{tab:dist_abl}
    \centering
    \setlength\tabcolsep{5.0pt} 
   
   
    \begin{tabular}{cc}
        \toprule
        Position & ARP $\uparrow$ \\
        \toprule
       
        \midrule
       
       
       
       
        Stage 1 & 0.75\\
        Stage 2 & 1.04\\
        Stage 3 & 2.39\\
        Stage 4 & 1.75\\
        \midrule
    \end{tabular}
\end{table}
\fi 
\begin{figure}
    \centering
    \includegraphics[width=0.8\columnwidth]{figures/ablation_plot.png}
    \caption[]
    {a) Distillation stage ablation results. b) $\alpha$ balancing tern ablation results.} 
    \label{fig:ablation_plot}
\end{figure}

\subsection{Balancing Term $\alpha$}
As discussed in Section \ref{sec:gfm}, our multi-objective loss in Equation \ref{eq:loss_combined} has the potential to use a balancing parameter $\alpha$. We ablate this parameter in Figure \ref{fig:ablation_plot} and report the corresponding ARP. Overall, we find that model with $\alpha=1.0$ performs the best.
\iffalse
\begin{table}
    \caption{Balancing Term $\alpha$ Ablation. \textcolor{red}{Make this a bar plot}}
    \label{tab:alpha_abl}
    \centering
    \setlength\tabcolsep{5.0pt} 
   
   
    \begin{tabular}{cc}
        \toprule
        $\alpha$ & ARP \\
        \toprule
       
       
       
       
       
       
        0.1 & 1.10\\
        0.5 & 1.64\\
        1.0 & 2.39\\
       
        \midrule
    \end{tabular}
\end{table}
\fi 
\subsection{GeoPile Pretraining Dataset}
To ablate components of the GeoPile, we remove each dataset individually to see its relative importance. Also, we compare using just the labeled data portion and using just the unlabeled NAIP imagery portion.

As expected, using just data from labeled datasets gives better performance with less images than using just images gathered from just NAIP. The human-curated samples in these datasets are more likely to contain relevant objects and features, as they each correspond to a particular class of interest. Still, unlabeled data like NAIP can be sourced easily and with scale. Further scaling of both labeled and unlabeled portions could further improve performance; however, it will also increase the training time and sustainability impact. Therefore, we maintain GeoPile at approximately 600,000 images.

\begin{table}
    \caption{GeoPile pretraining dataset ablation. We remove each dataset individually from GeoPile and report the number of images remaining and resulting ARP. The row ``w/o curated datasets" removes all data other than NAIP imagery.}
    \label{tab:data_ablation}
    \centering
    \setlength\tabcolsep{5.0pt} 
   
   
    \begin{tabular}{ccc}
        \toprule
        Data & \# Images & ARP $\uparrow$ \\
        \toprule
       
       
       
       
       
       
        w/o WHU-RSD46 & 444,061 & 2.87\\
        w/o MLRSNet & 451,793 & 3.30\\
        w/o Resisc45 & 529,454 & 2.72\\
        w/o PatternNet & 557,554 & 2.98\\
        w/o curated datasets & 300,000 & 1.62\\
        w/o NAIP & 260,954 & 2.65\\
       
        \midrule
    \end{tabular}
\end{table}

\subsection{Continual Pretraining Comparison}
In Table \ref{tab:init_ablation}, we compare our training paradigm with the vanilla continual pretraining approach of using the ImageNet-22k weights as initialization prior to beginning the pretraining step with GeoPile. We find this to be helpful in improving performance over simply starting from scratch. This validates the effectiveness of continual pretraining, even with simply initialization. However, the performance is still limited despite significant computation. On the other hand, our multi-objective pretraining paradigm significantly improves the overall performance with minimal computational needs and carbon impact. 
\begin{table}
    \caption{Continual pretraining comparison. In the first two rows, we experiment with simply initilizing the model with ImageNet-22k weights prior to conduction MIM training on GeoPile. However, our proposed GFM is both more effective and efficient.}
    \label{tab:init_ablation}
    \centering
    \setlength\tabcolsep{5.0pt} 
   
   
    \begin{tabular}{cccc}
        \toprule
        Method & Epochs & ARP $\uparrow$ & CO2 $\downarrow$\\
        \toprule
       
       
        ImageNet-22k Init. & 200 & 2.66 & 12.64\\
        ImageNet-22k Init. & 800 & 2.98 & 50.56\\
        GFM & 100 & 4.47 & 8.56\\
        \midrule
    \end{tabular}
\end{table}

\section{Conclusion}
In summary, this paper investigates a sustainable approach for building geospatial foundation models. To this end, we first construct a concise yet diverse collection of data from various sources for effective pretraining. Second, we propose a multi-objective continual pretraining paradigm, in which we leverage the strong representations of ImageNet-22k to guide and quicken learning, while simultaneously providing the freedom to learn valuable in-domain features through self-supervised learning on geospatial data.
We hope our GFM is one step forward in inspiring the path towards high-performing yet sustainable geospatial foundation models.
{\small
\bibliographystyle{ieee_fullname}

\section{Overview}

The supplementary material is organized into the following sections:

\begin{itemize}
    \item Section~\ref{training_details}: Training details for the pretraining stage and all downstream tasks.
    \item Section~\ref{carbon}: Details on training time and calculations of CO2 impact.
    \item Section~\ref{mim_ablation}: Experimental ablation of the GFM multi-objective training.
    \item Section~\ref{temporal}: Temporal pairs experiment in our multi-objective training.
    \item Section~\ref{superres_residual}: Further analysis on the SpaceNet2 super-resolution task.
    \item Section~\ref{broader_impact}: Discussion on the broader impact and limitations of our work.
\end{itemize}

\section{Training Details} \label{training_details}
We provide the training details for the various stages and tasks in our evaluation. \ul{Code and GeoPile dataset will be made publicly available upon acceptance}.

\textbf{Pretraining}:
We employ 8 NVIDIA A100 GPUs with a batch size of 2048 (128 per GPU) and the image size of 192$\times$192. All pretraining settings are the same as in \cite{simmim}.

\textbf{Change Detection}:
4 NVIDIA A10G GPUs are employed for all downstream tasks. We modify the MMsegmentation \cite{mmseg} framework to conduct our change detection experiments.
For OSCD, as the raw image size is large but the number of samples is very small, we tile the images into 192$\times$192 pixels and train for 4000 iterations. For DSFIN, we train for 10k iterations with image size 512$\times$512. We employ an SGD optimizer with a learning rate of 0.01 and weight decay of 5.0e-4, and the default polynomial scheduler of \cite{mmseg}.


\textbf{Classification}:
On UC Merced, we train with a batch size of 1024 (128 per GPU) at image size 256$\times$256. We train for 100 epochs with a base learning rate of 1.0e-4. We employ random flip, crop and standard Mixup \cite{mixup} augmentation. Optimizer, weight decay, Mixup parameters, and other training settings are the same as in \cite{simmim}.
For BigEarthNet, we slightly upscale the original 120$\times$120 images to 128$\times$128 for ease of dimensional compatibility with the Swin transformer. We then employ the same training settings as with UC Merced.

\textbf{Segmentation}:
We employ the MMsegmentation \cite{mmseg} framework to conduct our segmentation experiments. For both datasets, we train for 40k iterations with an image size of 512$\times$512. All other training settings are the same as the default configuration in \cite{mmseg} for the respective backbones (Swin, ViT, ResNet50) and compatible decoders (UperNet \cite{Upernet} for transformers and Deeplabv3 \cite{deeplab} for ResNets).

\textbf{Super-resolution}:
On the SpaceNet2 super-resolution tasks, we train with a batch size of 64 (16 per GPU) with input image size 160$\times$160 and target size 640$\times$640. We train for 100 epochs with a base learning rate of 1.25e-5. Optimizer, weight decay, and other training settings are the same as in \cite{simmim}, but with no random augmentations. We employ the standard decoder from \cite{simmim} to produce the original input size from the encoder features, and then upscale using a convolution-based upsampling block based on the image reconstruction module for classic super-resolution employed in \cite{swinir}.
Detailed results for all downstream experiments and ablations from the main manuscript are provided in Table \ref{tab:full_results}.
\begin{table*}[t]
    \caption{Ablation results for the training objectives in GFM. For w/o teacher, we only conduct MIM with GeoPile. For w/o MIM, we simply perform the distillation objective from the ImageNet-22k model to our student model with GeoPile.}
    \label{tab:min_ablation}
    \centering
    \setlength\tabcolsep{3.0pt} 
   
   
    \begin{tabular}{cccccccccc}
        \toprule
        Method & OSCD (F1) & DSFIN (F1) & UCM & BEN 10\% & BEN 1\% & WHU & Vai. & SN2 (PSNR) & SN2 (SSIM)\\
        \toprule
        w/o teacher & 57.3 & 67.65 & 98.8 & \textbf{86.5} & 80.0 & 90.5 & 74.0 & 22.509 & 0.631\\
        w/o MIM & 59.58 & \textbf{71.86} & 98.8 & 86.1 & 80.2 & 90.2 & 72.6 & 22.069 & 0.608\\
        \midrule
        GFM & \textbf{59.82} & 71.24 & \textbf{99.0} & 86.3 & \textbf{80.7} & \textbf{90.7} & \textbf{75.3} & \textbf{22.599} & \textbf{0.638}\\
        \bottomrule
    \end{tabular}
\end{table*}
\begin{table*}[t]
    \caption{Results for employing temporal pairs and datasets from SeCo \cite{seco} in our multi-objective pretraining framework. TP indicates that the teacher receives one image from a temporal pair, and the student receives the other. SI indicates that the same image is inputted to the teacher and student.}
    \label{tab:data_temp}
    \centering
    \setlength\tabcolsep{2.5pt} 
   
   
    \begin{tabular}{ccccccccccc}
        \toprule
        Dataset & Inputs & OSCD (F1) & DSFIN (F1) & UCM & BEN 10\% & BEN 1\% & WHU & Vai. & SN2 (PSNR) & SN2 (SSIM)\\
        \toprule
        SeCo 100k \cite{seco} & TP & 57.03 & 62.48 & 80.0 & 80.6 & 68.6 & 88.3 & 66.3 & 22.078 & 0.572\\
         SeCo 100k \cite{seco} & SI & 58.41 & 67.92 & 92.1 & 83.9 & 76.5 & 88.8 & 68.1 & 22.439 & 0.602\\
       
        SeCo 1M \cite{seco} & SI & 58.87 & 69.41 & 95.7 & 86.2 & 77.1 & 89.6 & 71.0 & 22.281 & 0.626\\
       
        \midrule
        GeoPile & SI & \textbf{59.82} & \textbf{71.24} & \textbf{99.0} & \textbf{86.3} & \textbf{80.7} & \textbf{90.7} & \textbf{75.3} & \textbf{22.599} & \textbf{0.638}\\
        \bottomrule
    \end{tabular}
\end{table*}

\section{Training Time and Carbon Calculations} \label{carbon}

To calculate the CO2 impact of training various models, we employ the ML CO2 Impact estimator at \url{https://mlco2.github.io/impact#compute} from \cite{co2}. The total impact is dependent on the hardware type, GPU provider, region, and total time used. Our pretraining experiments were conducted in the AWS US East (Ohio) region, which has a carbon efficiency of 0.57 kg eq. CO2 per kWh. For our GFM, just 7.5 hours of training is needed on 8 A100 GPUs, resulting in a total carbon impact of 8.56 kg eq. CO2. This is significantly lower than the previous state-of-the-art geospatial model, SatMAE \cite{satmae}. According to the reported carbon impact in their paper \cite{satmae}, SatMAE requires 109.44 kg eq. CO2 on the Google Cloud Platform us-central1 region, which has a carbon efficiency of 0.57 kg eq. CO2 per kWh (same as AWS US East Ohio). Therefore, \ul{GFM enables more than 12$\times$ reduction in total carbon impact in comparison to SatMAE.}


\section{Multi-objective Ablation} \label{mim_ablation}
To further ablate the performance of GFM, we also experiment with removing the teacher component and MIM component in Table \ref{tab:min_ablation}. We find that the multi-objective approach is the best performer overall. This shows that both the distillation and MIM objectives together are important aspects of efficient and effective geospatial learning. 

\section{Temporal Pairs Experiment} \label{temporal}
Some works employ temporal pairs in the pretraining procedure \cite{seco, gassl, matter}, meaning two satellite images from the same spatial region but taken at different times. We also experiment with the use of temporal positives in our training paradigm using the dataset proposed in SeCo \cite{seco}. In this case, the teacher receives one image from a temporal pair, and the student receives the other. The temporal changes can possibly serve as a form of natural augmentation for the distillation objective. However, as shown in Table \ref{tab:data_temp}, we find that using temporal positives (TP) is worse than simply using the same image (SI) for both branches. Therefore, we simply use the same image for both branches for other experiments. We further scale up the data by employing the 1M sample Sentinel-based dataset from SeCo. Nonetheless, GeoPile proves to be more effective as a pretraining data source for our GFM.

\section{Super-resolution with Residual Connection} \label{superres_residual}
In super-resolution tasks, a residual connection can be included from the input to the output stage \cite{swinir}. We make this modification as well for both ViT and Swin, and present the results in Table \ref{tab:spacenet_residual}. Interestingly, the Swin transformer benefits from this, while ViT does not. Nonetheless, in comparison to baselines, the conclusion is the same; SatMAE is not able to improve over its ImageNet-22k baseline, but GFM does.

\begin{table}
    \caption{SpaceNet2 super-resolution results with the residual connection.}
    \label{tab:spacenet_residual}
    \centering
    \setlength\tabcolsep{5.0pt} 
   
   
    \begin{tabular}{ccc}
        \toprule
        Method & PSNR $\uparrow$ & SSIM $\uparrow$\\
        \toprule
       
       
       
       
       
        ViT (ImageNet-22k)\cite{vit} & 22.548 & 0.629 \\
        SatMAE \cite{satmae} & 22.450 & 0.636 \\
        Swin (random) \cite{swin} & 22.190 & 0.642 \\
        Swin (ImageNet-22k) \cite{swin} & 22.918 & 0.640 \\
       
        \midrule
        GFM & \textbf{22.963} & \textbf{0.660} \\
        \midrule
    \end{tabular}
\end{table}

\begin{table*}[t]
    \caption{Detailed downstream results for all experiments in the main manuscript. We abbreviate the following for vertical space: UC Merced (UCM), BigEarthNet (BEN), WHU Aerial (WHU), Vaihingen (Vai), SpaceNet2 (SN2).}
    \label{tab:full_results}
    \centering
    \setlength\tabcolsep{2.5pt} 
   
   
    \begin{tabular}{cccccccccc}
        \toprule
        Method & OSCD (F1) & DSFIN (F1) & UCM & BEN 10\% & BEN 1\% & WHU & Vai. & SN2 (PSNR) & SN2 (SSIM)\\
        \toprule
        ImageNet-22k baseline & 52.35 & 69.62 & 99.0 & 85.7 & 79.5 & 90.4 & 74.7 & 21.655 & 0.612\\
        \midrule
        ImageNet-1k & 57.19 & 67.71 & 97.4 & 85.7 & 78.9 & 89.4 & 73.2 & 22.648 & 0.631\\
        Sentinel-2 & 55.14 & 64.31 & 94.5 & 84.9 & 70.0 & 86.2 & 63.3 & 19.961 & 0.566\\
        GeoPile (200ep) & 56.59 & 68.31 & 98.8 & 86.0 & 79.2 & 89.4 & 73.6 & 22.315 & 0.630\\
        GeoPile (800ep) & 57.30 & 67.65 & 98.8 & 86.5 & 80.0 & 90.5 & 74.0 & 22.509 & 0.631\\
        \midrule
        Stage 1 & 56.20 & 69.79 & 98.1 & 85.8 & 78.3 & 89.0 & 73.3 & 22.153 & 0.626\\
        Stage 2 & 58.97 & 68.27 & 96.9 & 86.1 & 79.0 & 89.4 & 72.2 & 22.409 & 0.625\\
       
        Stage 4 & 60.31 & 68.97 & 98.3 & 86.1 & 80.8 & 89.8 & 73.0 & 22.495 & 0.638\\
        \midrule
        $\alpha$ = 0.1 & 58.98 & 67.44 & 99.0 & 86.0 & 80.6 & 89.7 & 72.2 & 22.213 & 0.633\\
        $\alpha$ = 0.5 & 59.38 & 70.25 & 97.9 & 86.1 & 80.7 & 89.8 & 73.2 & 22.26 & 0.635\\
       
       
       
       
       
        \midrule
        w/o WHU-RSD46 & 58.79 & 69.25 & 98.3 & 86.1 & 80.6 & 89.7 & 72.9 & 22.51 & 0.632\\
        w/o MLRSNet & 60.01 & 69.21 & 98.8 & 86.1 & 80.5 & 89.9 & 72.9 & 22.409 & 0.633\\
        w/o Resisc45 & 58.33 & 69.22 & 98.6 & 86.3 & 80.7 & 89.8 & 72.4 & 22.206 & 0.635\\
        w/o PatternNet & 59.00 & 70.37 & 98.3 & 86.3 & 80.5 & 89.8 & 71.9 & 22.293 & 0.629\\
        w/o curated datasets & 58.49 & 67.16 & 98.1 & 85.7 & 79.9 & 88.9 & 72.7 & 22.852 & 0.584\\
        w/o NAIP & 58.72 & 70.54 & 98.3 & 85.5 & 79.6 & 89.7 & 70.8 & 22.574 & 0.632\\
       
        \midrule
        ImageNet-22k Init. (200ep) & 56.71 & 67.46 & 98.6 & 86.2 & 79.3 & 89.9 & 74.1 & 22.513 & 0.633\\
        ImageNet-22k Init. (800ep) & 57.52 & 66.23 & 98.8 & 86.3 & 79.3 & 90.1 & 75.1 & 22.626 & 0.645\\
        \midrule
       
       
        GFM & 59.82 & 71.24 & 99.0 & 86.3 & 80.7 & 90.7 & 75.3 & 22.599 & 0.638\\
        \bottomrule
    \end{tabular}
\end{table*}

\section{Broader Impact and Limitations} \label{broader_impact}
We anticipate that our GFM approach will serve as an example to inspire other works in investigating sustainable methods for developing geospatial foundation models. 
As the geospatial community continues to innovate, the resulting impact promises to positively benefit both the earth and society. Automating the process of extracting useful information from geospatial data can aid scientists, engineers, and others to make data-informed decisions on infrastructure advancement, food supply improvements, and natural disaster response.

A potential limitation of our GFM approach is that it may still be somewhat constrained by the performance of the ImageNet-22k model. If perhaps a model was trained from scratch on an extremely large corpus of remote sensing data, the performance may eventually also lead to improved performance over ImageNet baselines. However, this would incur a substantial amount of training and CO2 impact. Furthermore, ImageNet models are constantly being improved and released by the general computer vision community, providing a consistent source of better baseline models. Therefore, our approach enables the geospatial domain to effectively leverage these improvements for better in-domain performance with minimal carbon impact. We believe this is a sustainable way for the geospatial community to continually benefit from the most recent progress in computer vision, enabling a smarter, safer, and healthier planet. 
{\small
\bibliographystyle{ieee_fullname}
