InstSynth: Instance-wise Prompt-guided Style Masked Conditional Data Synthesis for Scene Understanding

Thanh-Danh Nguyen, Bich-Nga Pham, Trong-Tai Dam Vu, Vinh-Tiep Nguyen, Thanh Duc Ngo, Tam V. Nguyen

Published: 01 Jan 2024, Last Modified: 11 Jun 2025MAPR 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Scene understanding at the instance level is an essential task in computer vision to support modern Advanced Driver Assistance Systems. Solutions have been proposed with abundant annotated training data. However, the annotation at the instance level is high-cost due to huge manual efforts. In this work, we solve this problem by introducing InstSynth, an advanced framework leveraging instance-wise annotations as conditions to enrich the training data. Existing methods focused on semantic segmentation via using prompts to synthesize image-annotation pairs, facing an unrealistic manner. Our proposals utilize the strength of such large generative models to synthesize instance data with prompt-guided and mask-based mechanisms to boost the performance of the instance-level scene understanding models. We empirically improve the performance of the latest instance segmentation architectures of FastInst and OneFormer by 14.49% and 11.59% AP, respectively, evaluated on the Cityscapes benchmark. Accordingly, we construct an instance-level synthesized dataset, dubbed IS-Cityscapes, with over a 4× larger number of instances in comparison with the vanilla Cityscapes. Code can be found at https://github.com/danhntd/InstSynth.