Keywords: Diffusion Models for Vision, Semantic Scene Generation, Dataset Generation, Monocular SSC
Abstract: 3D semantic scene synthesis using discrete diffusion models faces severe challenges due to extreme class imbalance, where background voxels vastly outnumber foreground objects. This imbalance becomes particularly problematic in discrete diffusion for two reasons: (1) the denoising process operates in probability space rather than feature space, making minority classes vulnerable to majority absorption, and (2) learned transition probabilities exhibit systematic bias toward backgrounds, which compounds across diffusion steps, causing irreversible loss of foreground information. We identify this phenomenon as \textit{probabilistic flow collapse}---a fundamental limitation of existing methods. To address this, we propose the Compositional Discrete Denoising Diffusion Probabilistic Model (Comp-D3PM), which synthesizes 3D scenes by compositionally denoising foreground and background voxels through separate transition dynamics. Our contributions are threefold: (1) we formally characterize probabilistic flow collapse and introduce a two-stream architecture that prevents minority-class absorption through compositional modeling; (2) based on this architecture,we enable arange of applications, including the generation of image–semantic scene datasets; and (3) we demonstrate on CarlaSC and SemanticKITTI that Comp-D3PM produces significantly more realistic and diverse scenes while preserving semantic integrity.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 6920
Loading