CoSimGen: Controllable diffusion model for simultaneous image and segmentation mask generation

07 May 2025 (modified: 29 Oct 2025)Submitted to NeurIPS 2025EveryoneRevisionsBibTeXCC BY-NC-SA 4.0
Keywords: Generative AI, diffusion model, segmentation dataset generation, image-mask generation, inception distance metrics
TL;DR: A diffusion model for simultaneous generation of image-mask pairs controlled by either class or text prompt.
Abstract: Generating paired images and segmentation masks remains a core bottleneck in data-scarce domains such as medical imaging and remote sensing, where manual annotation is expensive, expertise-dependent, and ethically constrained. Existing generative approaches typically handle image or mask generation in isolation and offer limited control over spatial and semantic outputs. We introduce CoSimGen, a diffusion-based framework for controllable simultaneous generation of images and segmentation masks. CoSimGen integrates multi-level conditioning via (1) class-grounded textual prompts enabling hot-swapping of input control, (2) spatial embeddings for contextual coherence, and (3) spectral timestep embeddings for denoising control. To enforce alignment and generation fidelity, we combine contrastive triplet loss between text and class embeddings with diffusion and adversarial objectives. Low-resolution outputs ($128\times128$) are super-resolved to $512\times512$, ensuring high-fidelity synthesis. Evaluated across five diverse datasets, CoSimGen achieves state-of-the-art performance in FID, KID, LPIPS, and Semantic-FID, with KID as low as 0.11 and LPIPS of 0.53. Our method enables scalable, controllable dataset generation and advances multimodal generative modeling in structured prediction tasks.
Supplementary Material: zip
Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)
Submission Number: 8587
Loading