Keywords: 3D consistency, Diffusion, Reconstruction
Abstract: Recent breakthrough text-to-image models like GPT have achieved unprecedented photorealistic quality that rivals professional photography, yet our experimental analysis reveals critical geometric inconsistencies when leveraging these powerful models for multi-view generation. These inconsistencies manifest as specific rotational errors-such as facial expressions changing between views (open mouth becoming closed) or object details disappearing during rotation (remote control buttons missing in side views)-alongside systematic texture loss that fundamentally compromises downstream 3D reconstruction quality. While existing methods attempt to address multi-view consistency through end-to-end generation with geometric constraints, they face an inherent trade-off between visual fidelity and geometric coherence, often producing over-smoothed results that sacrifice the exceptional detail quality achievable by models like GPT. To harness the full potential of these powerful 2D foundation models while resolving their geometric limitations, we introduce a novel two-stage pipeline that strategically decouples view generation from geometric refinement. Our core contribution is MV-Diffus3R, a specialized plug-and-play refinement module that takes high-quality but geometrically inconsistent multi-view images from GPT and produces geometrically coherent outputs suitable for high-quality 3D reconstruction. MV-Diffus3R employs Plucker ray embeddings for precise geometric conditioning and a dual-pathway attention mechanism that simultaneously preserves fine visual details while enforcing cross-view geometric correspondence. Through comprehensive evaluation on GPT-generated multi-view sets, we demonstrate superior geometric fidelity compared to existing text-to-3D and multi-view generation methods, achieving 33% FID improvements while maintaining the exceptional visual quality that makes GPT outputs distinctive. Our approach provides an effective solution for bridging powerful but geometrically inconsistent 2D generators with the stringent geometric requirements of 3D content creation.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 19148
Loading