PhyMix: Towards Physically Consistent Single-Image 3D Indoor Scene Generation with Implicit–Explicit Optimization
Keywords: 3D Scene Generation, Indoor Scene Synthesis, Physics-Aware Optimization
TL;DR: PhyMix couples a Physics Evaluator with Scene-GRPO and test-time optimization to produce physically consistent single-image 3D indoor scenes.
Abstract: Existing single-image 3D indoor scene generators often produce results that look visually plausible but fail to obey real-world physics, limiting their reliability in robotics, embodied AI, and design. To examine this gap, we introduce a unified Physics Evaluator that measures four main aspects: contact, stability, geometric priors, and deployability, which are further decomposed into nine sub-constraints, establishing the first benchmark to measure physical consistency. Based on this evaluator, our analysis shows that state-of-the-art methods remain largely physics-unaware. To overcome this limitation, we further propose a framework that integrates feedback from the Physics Evaluator into both training and inference, enhancing the physical plausibility of generated scenes. Specifically, we propose PhyMix, which is composed of two complementary components: (i) implicit alignment via Scene-GRPO, a critic-free group-relative policy optimization that leverages the Physics Evaluator as a preference signal and biases sampling towards physically feasible layouts, and (ii) explicit refinement via a plug-and-play Test-Time Optimizer (TTO) that uses differentiable evaluator signals to correct residual violations during generation. Overall, our method unifies evaluation, reward shaping, and inference-time correction, producing 3D indoor scenes that are both visually faithful and physically plausible. Extensive evaluations on synthetic dataset confirm state-of-the-art performance in both visual fidelity and physical plausibility, and extensive qualitative examples on stylized and real-world images further showcase the method’s robustness.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 968
Loading