Less is More? Data Specialization for Self-Supervised Remote Sensing Models

Published: 10 Jun 2025, Last Modified: 17 Jul 2025TerraBytes 2025 withoutproceedingsEveryoneRevisionsBibTeXCC BY 4.0
Keywords: remote sensing, data specialization, data filtering, self-supervised learning
TL;DR: We improve downstream performance of self-supervised remote sensing models by smart removal of most of the images for pretraining.
Abstract: Recent foundation models for natural images, such as DINOv2, emphasize data curation as a critical component of the pretraining pipeline. These approaches typically aim to remove near-duplicate images and address semantic imbalance by applying clustering techniques to image representations extracted from pretrained models. While prior work on data curation primarily focuses on reducing computational cost while maintaining model quality, in this study we investigate data specialization—that is, whether reducing dataset size can improve model quality under a compute-controlled setting. We experiment with two remote sensing datasets, Million-AID and Maxar, apply two data pruning techniques to obtain smaller subsets, and pretrain self-supervision iBOT models while keeping the compute budget constant. We evaluate our models by k-NN on three remote sensing tasks. We show that filtering by hierarchical clustering improves the transfer of Maxar pretraining by 3 percentage points while removing 98.5\% of the dataset. On the contrary, neither of the filtering methods improve the transfer of Million-AID pretraining. This motivates future work on identifying and removing ``distracting'' inputs from the pretraining datasets to improve downstream performance.
Submission Number: 43
Loading