Keywords: 3D Scene Generation, Object Layout Optimization, Multi-View Consistency
Abstract: We introduce Geo-Refine, a single-image 3D scene generator that couples geometry–appearance preprocessing with a two-stage voxel–mesh localization pipeline to produce physically valid, visually complete multi-object scenes. Unlike prior methods that either overfit to image priors or rely on sequential post-hoc segmentation, Geo-Refine follows a unified, end-to-end formulation. Conditioned on one RGB image, it first extracts clean object regions through high-precision masking, directional color-spill suppression, and multi-view appearance consistency, then jointly optimizes object placement and fine mesh alignment. The global layout is cast as an energy-guided voxel reasoning problem that enforces projection evidence, ground support, and semantic co-location, while a subsequent mesh-level refinement stage guarantees collision-free, contact-accurate geometry. Experiments on diverse indoor and outdoor benchmarks show consistent gains in CLIP, VQ, and GPT-4 metrics, along with sharper geometry, stable object interactions, and improved multi-view fidelity over state-of-the-art image-to-3D baselines. These results highlight the value of Geo-refine for reliable single-image 3D scene synthesis and understanding.
Primary Area: generative models
Submission Number: 9524
Loading