Keywords: Aerial image reasoning and synthesis, Multi-modal conditioning, Spatial discrepancy, Latent diffusion
Abstract: Diffusion models have emerged as powerful generators in text-to-image synthesis, yet their extension to aerial imagery remains limited due to unique challenges such as high object density and geometric distortions. In this paper, we propose MIND, a multi-scale discrepancy-centric latent diffusion framework designed to address these issues and enable high-fidelity, semantically coherent aerial image synthesis. MIND introduces a theoretically justified method to estimate discrepancy maps that identify semantic and structural inconsistencies during generation, guiding both image synthesis and textual supervision. We incorporate these maps into the generation pipeline via three complementary mechanisms: (1) actor-critic visual reasoning to produce rationale-rich textual guidance using large language models, (2) discrepancy-augmented latent representation learning for spatial refinement, and (3) adaptive denoising that dynamically attends to hard-to-learn regions. Extensive experiments on VisDrone-DET and DroneRGBT demonstrate that MIND significantly outperforms state-of-the-art baselines in terms of visual quality, spatial alignment, and text-image consistency, establishing a strong foundation for structured and controllable aerial image synthesis.
Primary Area: generative models
Submission Number: 8145
Loading