Abstract: The development of foundation models through pre-
training of vision-language models (VLMs) has recently attracted
great attention in remote sensing (RS). VLM pretraining aims
to learn image and language alignments from a large number
of image-text pairs. Each pretraining image is often associated
with multiple captions with redundant information due to re-
peated or semantically similar phrases that result in increased
pretraining and inference time of VLMs. To overcome this, we
introduce a weighted feature aggregation (WFA) strategy for
VLM pretraining in RS. Our strategy aims to extract and exploit
complementary information from multiple captions per image,
while reducing redundancies through feature aggregation with
importance weighting. To calculate adaptive importance weights
for different captions of each image, we propose two different
techniques: i) non-parametric uniqueness; and ii) learning-based
attention. In the first technique, importance weights are calcu-
lated based on the bilingual evaluation understudy (BLEU)-scores
of the captions to emphasize unique sentences while removing
the influence of repetitive sentences. In the second technique,
importance weights are learned through an attention mechanism
instead of relying on hand-crafted features. The effectiveness of
the proposed WFA strategy with the two techniques is analyzed
in terms of downstream performance on text-to-image retrieval in
RS. Experimental results show that the proposed strategy enables
efficient and effective pretraining of VLMs in RS. Based on the
experimental analysis, we derive guidelines for the proper se-
lection of techniques, considering downstream task requirements
and resource constraints. The code of this work is publicly avail-
able at https://git.tu-berlin.de/rsim/redundacy-aware-rs-vlm.
Loading