RealmDreamer: Text-Driven 3D Scene Generation with Inpainting and Depth Diffusion

Table of Contents

Abstract

We introduce RealmDreamer, a technique for generation of general forward-facing 3D scenes from text descriptions. Our technique optimizes a 3D Gaussian Splatting representation to match complex text prompts. We initialize these splats by utilizing the state-of-the-art text-to-image generators, lifting their samples into 3D, and computing the occlusion volume. We then optimize this representation across multiple views as a 3D inpainting task with image-conditional diffusion models. To learn correct geometric structure, we incorporate a depth diffusion model by conditioning on the samples from the inpainting model, giving rich geometric structure. Finally, we finetune the model using sharpened samples from image generators. Notably, our technique does not require training on any scene-specific dataset and can synthesize a variety of high-quality 3D scenes in different styles, consisting of multiple objects. Its generality additionally allows 3D synthesis from a single image.


Abstract

We introduce RealmDreamer, a technique for generation of general forward-facing 3D scenes from text descriptions. Our technique optimizes a 3D Gaussian Splatting representation to match complex text prompts. We initialize these splats by utilizing the state-of-the-art text-to-image generators, lifting their samples into 3D, and computing the occlusion volume. We then optimize this representation across multiple views as a 3D inpainting task with image-conditional diffusion models. To learn correct geometric structure, we incorporate a depth diffusion model by conditioning on the samples from the inpainting model, giving rich geometric structure. Finally, we finetune the model using sharpened samples from image generators. Notably, our technique does not require training on any scene-specific dataset and can synthesize a variety of high-quality 3D scenes in different styles, consisting of multiple objects. Its generality additionally allows 3D synthesis from a single image.


Results


RGB Depth

A bear sitting in a classroom with a hat on, realistic, 4k image, high detail

bear bedroom bust boat lavender living_room piano resolute astronaut car bathroom bathroom bathroom bathroom bathroom bathroom bathroom bathroom japan bohemian

Comparisons



bear bedroom bust boat lavender living_room piano resolute astronaut car bathroom bathroom bathroom bathroom bathroom bathroom bathroom bathroom japan bohemian

Inpainting priors are great for occlusion reasoning

Using text conditioned 2D diffusion models for 3D scene generation is tricky given the lack of 3D consistency across different samples. We mitigate this by leveraging 2D inpainting priors as novel view estimators instead. By rendering an incomplete 3D model and inpainting unknown regions, we learn to generate consistent 3D scenes.

Image to 3D

We show that our technique can generate 3D scenes from a single image. This is a challenging task as it requires the model to hallucinate the missing geometry and texture in the scene. We do not require training on any scene-specific dataset.

Input Image

"The Brandenburg Gate in Berlin, large stone gateway with series of columns and a sculpture of a chariot and horses on stop, clear sky, 4k image, photorealistic"

Input Image

"A minimal conference room, with a long table, a screen on the wall and a whiteboard, 4k image, photorealistic, sharp"