Learning from Synthetic Data for Visual Grounding

Published: 06 May 2025, Last Modified: 29 May 2025SynData4CVEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Visual Grounding, Synthetic Data, Learning from Models
Abstract:

This paper extensively investigates the effectiveness of synthetic training data in improving the capabilities of vision-and-language models for visual grounding. We explore various strategies to best generate image-text pairs and image-text-box triplets using a series of pretrained models. Through comparative analyses with synthetic, real, and web-crawled data, we identify factors that contribute to performance differences, and propose SynGround, an effective pipeline for generating useful synthetic data for visual grounding. We show that data generated with SynGround improves the pointing game accuracy of pretrained ALBEF and BLIP models by 4.81% and 17.11% absolute percentage points, respectively, across the RefCOCO+ and the Flickr30k benchmarks.

Submission Number: 30
Loading