Iterative Self-Training with Class-Aware Text-to-Image Synthesis for Visual Task Learning

Xiang Zhang, Wanqing Zhao, Pengyang Li, Ying Liu, Hangzai Luo, Sheng Zhong, Jinye Peng, Jianping Fan

Published: 01 Jan 2025, Last Modified: 21 Nov 2025AAAI 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Generative models are widely used to produce synthetic images with annotations, alleviating the burden of image collection and annotation for training deep visual models. However, challenges such as limited image diversity, noisy pseudo labels, and domain gaps between synthetic and real images often undermine their effectiveness in downstream visual tasks. This paper introduces the Iterative Self-Training with Class-Aware Text-to-Image Synthesis (IST-CATS) framework, which addresses these challenges by integrating a class-aware text-to-image synthesis (CATS) component with an iterative self-training (IST) strategy. CATS innovatively introduces a class-aware chain approach to generate detailed descriptions. These descriptions act as prompts for a diffusion model, enabling the creation of a diverse of images accompanied by distinguishable objects against the background. The generated images can be easily pseudo-labeled by an unsupervised instance segmentation method, and then noisy pseudo labels can be effectively purified by a novel feature similarity-based filtering mechanism. The generated images underpin our IST, which progressively enhances vision models and refines pseudo labels through self-training and our proposed label filtering strategy (LabFilt). LabFilt meticulously improves the quality of pseudo labels by employing class-adaptive techniques at both the pixel and object levels, ensuring refined pseudo-label accuracy. IST-CATS demonstrates superior performance in object detection and semantic segmentation compared to traditional synthetic and semi/weakly-supervised methods, effectively addressing data collection and annotation challenges.