DiScan: a quad-directional SSM diffusion framework

19 Sept 2025 (modified: 15 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: State Space Models, Text-to-Image Synthesis, Multi-directional Scanning, Retrieval-Augmented Generation, Spectral-Spatial Fusion, Wavelet Decomposition, Cross-Modal Alignment
Abstract: Text-to-image synthesis models often suffer from texture blurring, shape distortion, and poor alignment with textual prompts. These issues stem from limited spatial modeling, weak cross-modal interaction, and insufficient detail preservation. To address them, we propose DiScan, a framework combining directional state-space modeling with retrieval-based fusion for efficient, high-fidelity synthesis. First, we introduce a quad-directional SSM that jointly scans visual and textual features across directions. It shares dynamics for parameter efficiency and uses direction-specific projections to enhance spatial coherence and semantic consistency. Second, we design a dual-stage attention module using retrieved references. The first stage aligns prompt and image features via cross-attention. The second modulates features through direction-aware scanning, improving structure preservation. Third, we propose a spatial-frequency fusion block that combines wavelet decomposition with bidirectional scanning. It captures fine textures and enhances local details. Extensive experiments show DiScan outperforms Zigma and USM, achieving significant FID improvements (+6.3 on CelebA-HQ, +9.95 on COCO, +0.67 on CIFAR-10) while maintaining excellent visual quality. Our work establishes directional SSM diffusion as a scalable paradigm for efficient high-fidelity synthesis.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 17752
Loading