Abstract: While recent methods have achieved impressive results in 3D reconstruction, they typically rely on dense multi-view inputs and often struggle with ambiguity in occluded or unobserved regions, particularly in complex scene layouts and background areas. We propose Learning to See the Unseen (LSU), a unified framework for high-fidelity 3D scene reconstruction from sparse or even single-image inputs by coupling generative novel-view synthesis with Gaussian-based scene reconstruction. Our approach introduces a Scene Diffusion Module (SDM) that conditions on sparse views and text prompts to synthesize consistent novel views. To improve spatial alignment across generated views, SDM incorporates a scene-level geometric supervision strategy that constrains the diffusion process using 3D structural consistency. Additionally, we design a geometry-aware Gaussian reconstruction module that leverages depth and surface normal priors to refine the reconstructed scene, improving geometric accuracy, background coherence, and rendering fidelity. Extensive experiments demonstrate that LSU achieves state-of-the-art performance on the RealEstate10K dataset and generalizes effectively to unseen domains, including KITTI and Mip-NeRF, recovering accurate global geometry while preserving fine-grained visual details across diverse scenes.
Submission Type: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: ## Revision Summary
**Appendix C**
- Added Qualitative comparison with ReconX and MVGD.
**Appendix E**
- Limitations and future works were moved from Section 6 in main paper
- Qualitative results of limitations added.
Assigned Action Editor: ~Jianbo_Jiao2
Submission Number: 7925
Loading