CC-Diff++: Spatially Controllable Text-to-Image Synthesis for Remote Sensing With Enhanced Contextual Coherence

Published: 01 Jan 2025, Last Modified: 04 Nov 2025IEEE Trans. Geosci. Remote. Sens. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Generating visually realistic remote sensing (RS) images requires maintaining semantic coherence between objects and their surrounding environments. However, existing image synthesis methods prioritize foreground controllability while oversimplifying backgrounds into plain or generic textures. This oversight neglects the crucial interaction between foreground and background elements, resulting in semantic inconsistencies in RS scenarios. To address this challenge, we propose CC-Diff++, a diffusion model-based approach for spatially controllable RS image synthesis with enhanced context coherence. To capture spatial interdependence, we propose a novel module named Co-Resampler, which employs an advanced masked attention mechanism to jointly extract features from both the foreground and background while modeling their mutual relationships. Furthermore, we introduce a text-to-layout prediction module powered by large language models (LLMs) and a reference image retrieval mechanism for providing rich textural guidance, which work together to enable CC-Diff++ to generate outputs that are both more diverse and more realistic. Extensive experiments demonstrate that CC-Diff++ outperforms state-of-the-art methods in visual fidelity, semantic accuracy, and positional precision on multiple RS datasets. CC-Diff++ also shows strong trainability, improving detection accuracy by 2.04 mAP on DOTA and 11.81 mAP on the HRSC dataset.
Loading