everyone
since 06 May 2025">EveryoneRevisionsBibTeXCC BY 4.0
This paper extensively investigates the effectiveness of synthetic training data in improving the capabilities of vision-and-language models for visual grounding. We explore various strategies to best generate image-text pairs and image-text-box triplets using a series of pretrained models. Through comparative analyses with synthetic, real, and web-crawled data, we identify factors that contribute to performance differences, and propose SynGround, an effective pipeline for generating useful synthetic data for visual grounding. We show that data generated with SynGround improves the pointing game accuracy of pretrained ALBEF and BLIP models by 4.81% and 17.11% absolute percentage points, respectively, across the RefCOCO+ and the Flickr30k benchmarks.