SatViT: Pretraining Transformers for Earth Observation

Anthony Fuller, Koreen Millard, James R. Green

Published: 01 Jan 2022, Last Modified: 12 May 2023IEEE Geosci. Remote. Sens. Lett. 2022Readers: Everyone

Abstract: Despite the enormous success of the “pretraining and fine-tuning” paradigm, widespread across machine learning, it has yet to pervade remote sensing (RS). To help rectify this, we pretrain a vision transformer (ViT) on 1.3 million satellite-derived RS images. We pretrain SatViT using a state-of-the-art (SOTA) self-supervised learning (SSL) algorithm called masked autoencoding (MAE), which learns general representations by reconstructing held-out image patches. Crucially, this approach does not require annotated data, allowing us to pretrain on unlabeled images acquired from Sentinel-1 and 2. After fine-tuning, SatViT outperforms SOTA ImageNet and RS-specific pretrained models on both of our downstream tasks. We further improve the overall accuracy (OA) (by 3.2% and 0.21%) by continuing to pretrain SatViT—still using MAE—on the unlabelled target datasets. Most importantly, we release our code, pretrained model weights, and tutorials aimed at helping researchers fine-tune our models ( <uri xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">https://github.com/antofuller/SatViT</uri> ).

0 Replies