
<!DOCTYPE html>
<html>

<head lang="en">
    <meta charset="UTF-8">
    <meta http-equiv="x-ua-compatible" content="ie=edge">

    <title>RealmDreamer: Text-Driven 3D Scene Generation with Inpainting and Depth Diffusion</title>

    <meta name="description" content="">
    <meta name="viewport" content="width=device-width, initial-scale=1">

    <meta property="og:image:type" content="image/png">
    <meta property="og:image:width" content="1711">
    <meta property="og:image:height" content="576">
    <meta property="og:type" content="website" />
    <meta property="og:title" content="RealmDreamer: Text-Driven 3D Scene Generation with Inpainting and Depth Diffusion" />
    <meta property="og:description" content="We introduce RealmDreamer, a technique for generation of general forward-facing 3D scenes from text descriptions. Our technique optimizes a 3D Gaussian Splatting representation to match complex text prompts. We initialize these splats by utilizing the state-of-the-art text-to-image generators, lifting their samples into 3D, and computing the occlusion volume. We then optimize this representation across multiple views as a 3D inpainting task with image-conditional diffusion models. To learn correct geometric structure, we incorporate a depth diffusion model by conditioning on the samples from the inpainting model, giving rich geometric structure. Finally, we finetune the model using sharpened samples from image generators. Notably, our technique does not require training on any scene-specific dataset and can synthesize a variety of high-quality 3D scenes in different styles, consisting of multiple objects. Its generality additionally allows 3D synthesis from a single image. 
    ."/>

    <meta name="twitter:card" content="summary_large_image" />
    <meta name="twitter:title" content="RealmDreamer: Text-Driven 3D Scene Generation with Inpainting and Depth Diffusion" />
    <meta name="twitter:description" content="We introduce RealmDreamer, a technique for generation of general forward-facing 3D scenes from text descriptions. Our technique optimizes a 3D Gaussian Splatting representation to match complex text prompts. We initialize these splats by utilizing the state-of-the-art text-to-image generators, lifting their samples into 3D, and computing the occlusion volume. We then optimize this representation across multiple views as a 3D inpainting task with image-conditional diffusion models. To learn correct geometric structure, we incorporate a depth diffusion model by conditioning on the samples from the inpainting model, giving rich geometric structure. Finally, we finetune the model using sharpened samples from image generators. Notably, our technique does not require training on any scene-specific dataset and can synthesize a variety of high-quality 3D scenes in different styles, consisting of multiple objects. Its generality additionally allows 3D synthesis from a single image.
    "/>


<link rel="icon" href="data:image/svg+xml,<svg xmlns=%22http://www.w3.org/2000/svg%22 viewBox=%220 0 100 100%22><text y=%22.9em%22 font-size=%2290%22>🏛️</text></svg>">

<link href="https://cdn.jsdelivr.net/npm/bootstrap@5.3.3/dist/css/bootstrap.min.css" rel="stylesheet" integrity="sha384-QWTKZyjpPEjISv5WaRU9OFeRpok6YctnYmDr5pNlyT2bRjXh0JMhjY6hW+ALEwIH" crossorigin="anonymous">    <link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/font-awesome/4.4.0/css/font-awesome.min.css">
    <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/codemirror/5.8.0/codemirror.min.css">
    <link rel="stylesheet" href="css/app.css">
	<link rel="stylesheet" href="css/fontawesome.all.min.css">
	<link rel="stylesheet"
        href="https://cdn.jsdelivr.net/gh/jpswalsh/academicons@1/css/academicons.min.css">


	<!-- Google tag (gtag.js) -->
    <script src="https://ajax.googleapis.com/ajax/libs/jquery/1.11.3/jquery.min.js"></script>
    <script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.5/js/bootstrap.min.js"></script>
    <script src="https://cdnjs.cloudflare.com/ajax/libs/codemirror/5.8.0/codemirror.min.js"></script>
    <script src="https://cdnjs.cloudflare.com/ajax/libs/clipboard.js/1.5.3/clipboard.min.js"></script>
    <script type="text/javascript" async
        src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.7/MathJax.js?config=TeX-MML-AM_CHTML">
    </script>
	<script defer src="js/fontawesome.all.min.js"></script>
    <script src="https://cdnjs.cloudflare.com/ajax/libs/Chart.js/2.5.0/Chart.min.js"></script>

    <script src="js/app.js"></script>
    <script src="js/synced_video_selector.js"></script>

</head>

<body style="padding: 0%; width: 100%">
    <div class="container-fluid bg-white text-black py-3">
        <h1 class="text-center">RealmDreamer: Text-Driven 3D Scene Generation with Inpainting and Depth Diffusion</h1>
    </div>

    <div class="desktop-content">
    <!-- Table of Contents Section -->
    <div class="container my-5">
        <h2 class="text-center mb-4">Table of Contents</h2>
        <ul class="list-group">
            <li class="list-group-item"><a href="#main">Teaser</a>: Teaser video showing our method's application.</li>
            <li class="list-group-item"><a href="#results">Results</a>: Renderings and depth from all scenes created by our method.</li>
            <li class="list-group-item"><a href="#comparisons">Comparisons</a>: Side-by-side comparisons with state-of-the-art baselines.</li>
            <li class="list-group-item"><a href="#single">Image to 3D</a>: Applying our method to single images in the wild.</li>
        </ul>
    </div>
        <div class="container-fluid px-5" id="main">

            <div class="row py-5 align-items-center justify-content-center" id="abstract">
                <div class="col-md-5">
                    <h1 class="text-center pb-5 text-bold">
                        <b>Abstract</b>
                    </h1>
                    <p class="text-justify px-5">
                        We introduce <b>RealmDreamer</b>, a technique for generation of general forward-facing 3D scenes from text descriptions. Our technique optimizes a 3D Gaussian Splatting representation to match complex text prompts. We initialize these splats by utilizing the state-of-the-art text-to-image generators, lifting their samples into 3D, and computing the occlusion volume. We then optimize this representation across multiple views as a 3D inpainting task with image-conditional diffusion models. To learn correct geometric structure, we incorporate a depth diffusion model by conditioning on the samples from the inpainting model, giving rich geometric structure. Finally, we finetune the model using sharpened samples from image generators. Notably, our technique does not require training on any scene-specific dataset and can synthesize a variety of high-quality 3D scenes in different styles, consisting of multiple objects. Its generality additionally allows 3D synthesis from a single image.
                    </p>
                </div>

                <div class="col-md-7">
                    <video class="img-fluid" loop autoplay muted>
                        <source src="videos/teaser/teaser_twitter.mp4" />
                    </video>
                </div>
            </div>
            <br>
        </div>  
        
        <!-- <div class="container-fluid" id="main">

            <div class="row py-5 align-items-center justify-content-center" id="abstract">
                <div class="col-12 text-center">
                    <video id="myVideo" class="img-fluid w-50 w-md-75 mx-auto" loop autoplay muted playsinline poster="videos/teaser/poster.png" controls alt="RealmDreamer Teaser">
                        <source src="videos/teaser/teaser_twitter.mp4" />
                    </video>
                </div>
            </div>
            <br>
        </div>   -->
    </div>

    <div class="mobile-content">
        <div class="container" id="main">

            <div class="row py-5 align-items-center justify-content-center" id="abstract">
                <div class="col-md-6">
                    <h1 class="text-center pb-5 text-bold">
                        <b>Abstract</b>
                    </h1>
                    <p class="text-justify">
                        We introduce <b>RealmDreamer</b>, a technique for generation of general forward-facing 3D scenes from text descriptions. Our technique optimizes a 3D Gaussian Splatting representation to match complex text prompts. We initialize these splats by utilizing the state-of-the-art text-to-image generators, lifting their samples into 3D, and computing the occlusion volume. We then optimize this representation across multiple views as a 3D inpainting task with image-conditional diffusion models. To learn correct geometric structure, we incorporate a depth diffusion model by conditioning on the samples from the inpainting model, giving rich geometric structure. Finally, we finetune the model using sharpened samples from image generators. Notably, our technique does not require training on any scene-specific dataset and can synthesize a variety of high-quality 3D scenes in different styles, consisting of multiple objects. Its generality additionally allows 3D synthesis from a single image.
                    </p>
                </div>
                <!-- <div class="col--md-6">
                    <video class="img-fluid" loop autoplay muted>
                        <source src="videos/teaser/teaser_twitter.mp4" />
                    </video>
                </div> -->
            </div>
            <br>
        </div>    
    </div>

    <div class="container-fluid" id="results">
        <div class="row py-5 mt-5 bg-dark">
            <div class="col-2"></div>
            <div class="col-md-8">
                <h1 class="text-center pb-2 text-white">
				  <b>Results</b>
                </h1><br>

                <script>
                    activeMethodPill = "zipnerf"
                    activeScenePill = document.querySelector('.scene-pill.active-pill');
                    activeModePill = document.querySelector('.mode-pill.active-pill');
                </script>
                
                <div class="text-center">
                    <div class="video-container">
                        <video class="video" style="height: 480px; max-width: 100%;" m id="compVideo0" loop playsinline autoplay muted>
                            <source src="videos/results/rgb/bear.mp4" />
                        </video>
                        <video class="video" style="height: 480px; max-width: 100%;" id="compVideo1" loop playsinline autoplay muted hidden>
                            <source src="videos/results/depth/bear.mp4" />
                        </video>
                    </div>
                    <div class="text-center" style="color: black;" id="mode-pills">
                        <div class="btn-group btn-group-sm">
                            <span class="btn btn-primary mode-pill active" data-value="rgb"
                                onclick="selectCompVideo(activeMethodPill, activeScenePill, null, this)">
                                RGB
                            </span>
                            <span class="btn btn-primary mode-pill" data-value="depth"
                                onclick="selectCompVideo(activeMethodPill, activeScenePill, null, this)">
                                Depth
                            </span>
                        </div>
                    </div>


                    <br>
                    <p class="text-justify text-white" style="text-align: center;" id="prompt-box">A bear sitting in a classroom with a hat on, realistic, 4k image, high detail</p>
                    <script>
                        video0 = document.getElementById("compVideo0");
                        video1 = document.getElementById("compVideo1");

                        video0.addEventListener('loadedmetadata', function() {
                            if (activeVidID == 0 && select){
                                video0.play();
                                // print video size
                                console.log(video0.videoWidth, video0.videoHeight);
                                video0.hidden = false;
                                video1.hidden = true;
                            }
                        });

                        video1.addEventListener('loadedmetadata', function() {
                            if (activeVidID == 1 && select){
                                video1.play();
                                // print video size
                                console.log(video1.videoWidth, video1.videoHeight);
                                video0.hidden = true;
                                video1.hidden = false;
                            }
                        });
                    </script>

                    <div class="pill-row scene-pills" id="scene-pills">
                        <span class="pill scene-pill active" data-value="bear" onclick="selectCompVideo(activeMethodPill, this, 3)">
                            <img class="thumbnail-img" src="thumbnails/bear.png" alt="bear" width="64">
                        </span>
                        <span class="pill scene-pill" data-value="bedroom3" onclick="selectCompVideo(activeMethodPill, this, 3)">
                            <img class="thumbnail-img" src="thumbnails/bedroom3.png" alt="bedroom" width="64">
                        </span>
                        <span class="pill scene-pill" data-value="bust" onclick="selectCompVideo(activeMethodPill, this, 3)">
                            <img class="thumbnail-img" src="thumbnails/bust.png" alt="bust" width="64">
                        </span>
                        <span class="pill scene-pill" data-value="boat" onclick="selectCompVideo(activeMethodPill, this, 3)">
                            <img class="thumbnail-img" src="thumbnails/boat.jpg" alt="boat" width="64">
                        </span>
                        <span class="pill scene-pill" data-value="lavender" onclick="selectCompVideo(activeMethodPill, this, 3)">
                            <img class="thumbnail-img" src="thumbnails/lavender.png" alt="lavender" width="64">
                        </span>
                        <span class="pill scene-pill" data-value="living_room" onclick="selectCompVideo(activeMethodPill, this, 3)">
                            <img class="thumbnail-img" src="thumbnails/living_room.png" alt="living_room" width="64">
                        </span>
                        <span class="pill scene-pill" data-value="piano" onclick="selectCompVideo(activeMethodPill, this, 3)">
                            <img class="thumbnail-img" src="thumbnails/piano.jpg" alt="piano" width="64">
                        </span>
                        <span class="pill scene-pill" data-value="resolute" onclick="selectCompVideo(activeMethodPill, this, 3)">
                            <img class="thumbnail-img" src="thumbnails/resolute.png" alt="resolute" width="64">
                        </span>
                        <span class="pill scene-pill" data-value="astronaut2" onclick="selectCompVideo(activeMethodPill, this, 3)">
                            <img class="thumbnail-img" src="thumbnails/astronaut2.jpg" alt="astronaut" width="64">
                        </span>
                        <span class="pill scene-pill" data-value="car" onclick="selectCompVideo(activeMethodPill, this, 3)">
                            <img class="thumbnail-img" src="thumbnails/car.png" alt="car" width="64">
                        </span>
                        <span class="pill scene-pill" data-value="bathroom" onclick="selectCompVideo(activeMethodPill, this, 3)">
                            <img class="thumbnail-img" src="thumbnails/bathroom2.png" alt="bathroom" width="64">
                        </span>
                        <span class="pill scene-pill" data-value="surf" onclick="selectCompVideo(activeMethodPill, this, 3)">
                            <img class="thumbnail-img" src="thumbnails/surf.png" alt="bathroom" width="64">
                        </span>
                        <span class="pill scene-pill" data-value="victorian" onclick="selectCompVideo(activeMethodPill, this, 3)">
                            <img class="thumbnail-img" src="thumbnails/victorian.jpg" alt="bathroom" width="64">
                        </span>
                        <span class="pill scene-pill" data-value="lighthouse" onclick="selectCompVideo(activeMethodPill, this, 3)">
                            <img class="thumbnail-img" src="thumbnails/lighthouse.png" alt="bathroom" width="64">
                        </span>
                        <span class="pill scene-pill" data-value="forest" onclick="selectCompVideo(activeMethodPill, this, 3)">
                            <img class="thumbnail-img" src="thumbnails/forest.jpg" alt="bathroom" width="64">
                        </span>
                        <span class="pill scene-pill" data-value="kitchen" onclick="selectCompVideo(activeMethodPill, this, 3)">
                            <img class="thumbnail-img" src="thumbnails/kitchen.png" alt="bathroom" width="64">
                        </span>
                        <span class="pill scene-pill" data-value="steampunk" onclick="selectCompVideo(activeMethodPill, this, 3)">
                            <img class="thumbnail-img" src="thumbnails/steampunk.jpg" alt="bathroom" width="64">
                        </span>
                        <span class="pill scene-pill" data-value="arcade" onclick="selectCompVideo(activeMethodPill, this, 3)">
                            <img class="thumbnail-img" src="thumbnails/arcade.jpg" alt="bathroom" width="64">
                        </span>
                        <span class="pill scene-pill" data-value="japan" onclick="selectCompVideo(activeMethodPill, this, 3)">
                            <img class="thumbnail-img" src="thumbnails/japan.png" alt="japan" width="64">
                        </span>
                        <span class="pill scene-pill" data-value="bohemian" onclick="selectCompVideo(activeMethodPill, this, 3)">
                            <img class="thumbnail-img" src="thumbnails/bohemian.png" alt="bohemian" width="64">
                        </span>
                    </div>

                    <script>
                        activeMethodPill = document.querySelector('.method-pill.active-pill');
                        activeScenePill = document.querySelector('.scene-pill.active-pill');
                        activeModePill = document.querySelector('.mode-pill.active-pill');
                    </script>
                </div>

            </div>
            <div class="col-2"></div>
        </div>
    </div>


    <div class="container-fluid" id="comparisons">
        <div class="row py-5 mt-5 bg-dark">
            <div class="col-2"></div>
            <div class="col-md-8">
                <h1 class="text-center pb-2 text-white">
                    <b>Comparisons</b>
                </h1><br>
    
                <script>
                    activeMethodPill2 = "zipnerf"
                    activeScenePill2 = document.querySelector('.scene-pill.active-pill');
                    activeModePill2 = document.querySelector('.mode-pill.active-pill');
                </script>
    
                <div class="text-center">
                    <div class="video-container">
                        <video class="video" style="height: 480px; max-width: 100%;" id="compVideo20" loop playsinline autoplay muted>
                            <source src="videos/comparison/bear.mp4" />
                        </video>
                        <video class="video" style="height: 480px; max-width: 100%;" id="compVideo21" loop playsinline autoplay muted hidden>
                            <source src="videos/results/depth/bear.mp4" />
                        </video>
                    </div>
    
                    <br>
                    <script>
                        video02 = document.getElementById("compVideo20");
                        video12 = document.getElementById("compVideo21");
    
                        video02.addEventListener('loadedmetadata', function() {
                            if (activeVidID2 == 0 && select2){
                                video02.play();
                                // print video size
                                console.log(video02.videoWidth, video02.videoHeight);
                                video02.hidden = false;
                                video12.hidden = true;
                            }
                        });
    
                        video12.addEventListener('loadedmetadata', function() {
                            if (activeVidID2 == 1 && select2){
                                video12.play();
                                // print video size
                                console.log(video12.videoWidth, video12.videoHeight);
                                video02.hidden = true;
                                video12.hidden = false;
                            }
                        });
                    </script>
    
                    <div class="pill-row scene-pills" id="scene-pills">
                        <span class="pill scene-pill active" data-value="bear" onclick="selectCompVideo2(activeMethodPill2, this, 3)">
                            <img class="thumbnail-img" src="thumbnails/bear.png" alt="bear" width="64">
                        </span>
                        <span class="pill scene-pill" data-value="bedroom3" onclick="selectCompVideo2(activeMethodPill2, this, 3)">
                            <img class="thumbnail-img" src="thumbnails/bedroom3.png" alt="bedroom" width="64">
                        </span>
                        <span class="pill scene-pill" data-value="bust" onclick="selectCompVideo2(activeMethodPill2, this, 3)">
                            <img class="thumbnail-img" src="thumbnails/bust.png" alt="bust" width="64">
                        </span>
                        <span class="pill scene-pill" data-value="boat" onclick="selectCompVideo2(activeMethodPill2, this, 3)">
                            <img class="thumbnail-img" src="thumbnails/boat.jpg" alt="boat" width="64">
                        </span>
                        <span class="pill scene-pill" data-value="lavender" onclick="selectCompVideo2(activeMethodPill2, this, 3)">
                            <img class="thumbnail-img" src="thumbnails/lavender.png" alt="lavender" width="64">
                        </span>
                        <span class="pill scene-pill" data-value="living_room" onclick="selectCompVideo2(activeMethodPill2, this, 3)">
                            <img class="thumbnail-img" src="thumbnails/living_room.png" alt="living_room" width="64">
                        </span>
                        <span class="pill scene-pill" data-value="piano" onclick="selectCompVideo2(activeMethodPill2, this, 3)">
                            <img class="thumbnail-img" src="thumbnails/piano.jpg" alt="piano" width="64">
                        </span>
                        <span class="pill scene-pill" data-value="resolute" onclick="selectCompVideo2(activeMethodPill2, this, 3)">
                            <img class="thumbnail-img" src="thumbnails/resolute.png" alt="resolute" width="64">
                        </span>
                        <span class="pill scene-pill" data-value="astronaut2" onclick="selectCompVideo2(activeMethodPill2, this, 3)">
                            <img class="thumbnail-img" src="thumbnails/astronaut2.jpg" alt="astronaut" width="64">
                        </span>
                        <span class="pill scene-pill" data-value="car" onclick="selectCompVideo2(activeMethodPill2, this, 3)">
                            <img class="thumbnail-img" src="thumbnails/car.png" alt="car" width="64">
                        </span>
                        <span class="pill scene-pill" data-value="bathroom" onclick="selectCompVideo2(activeMethodPill2, this, 3)">
                            <img class="thumbnail-img" src="thumbnails/bathroom2.png" alt="bathroom" width="64">
                        </span>
                        <span class="pill scene-pill" data-value="surf" onclick="selectCompVideo2(activeMethodPill2, this, 3)">
                            <img class="thumbnail-img" src="thumbnails/surf.png" alt="bathroom" width="64">
                        </span>
                        <span class="pill scene-pill" data-value="victorian" onclick="selectCompVideo2(activeMethodPill2, this, 3)">
                            <img class="thumbnail-img" src="thumbnails/victorian.jpg" alt="bathroom" width="64">
                        </span>
                        <span class="pill scene-pill" data-value="lighthouse" onclick="selectCompVideo2(activeMethodPill2, this, 3)">
                            <img class="thumbnail-img" src="thumbnails/lighthouse.png" alt="bathroom" width="64">
                        </span>
                        <span class="pill scene-pill" data-value="forest" onclick="selectCompVideo2(activeMethodPill2, this, 3)">
                            <img class="thumbnail-img" src="thumbnails/forest.jpg" alt="bathroom" width="64">
                        </span>
                        <span class="pill scene-pill" data-value="kitchen" onclick="selectCompVideo2(activeMethodPill2, this, 3)">
                            <img class="thumbnail-img" src="thumbnails/kitchen.png" alt="bathroom" width="64">
                        </span>
                        <span class="pill scene-pill" data-value="steampunk" onclick="selectCompVideo2(activeMethodPill2, this, 3)">
                            <img class="thumbnail-img" src="thumbnails/steampunk.jpg" alt="bathroom" width="64">
                        </span>
                        <span class="pill scene-pill" data-value="arcade" onclick="selectCompVideo2(activeMethodPill2, this, 3)">
                            <img class="thumbnail-img" src="thumbnails/arcade.jpg" alt="bathroom" width="64">
                        </span>
                        <span class="pill scene-pill" data-value="japan" onclick="selectCompVideo2(activeMethodPill2, this, 3)">
                            <img class="thumbnail-img" src="thumbnails/japan.png" alt="japan" width="64">
                        </span>
                        <span class="pill scene-pill" data-value="bohemian" onclick="selectCompVideo2(activeMethodPill2, this, 3)">
                            <img class="thumbnail-img" src="thumbnails/bohemian.png" alt="bohemian" width="64">
                        </span>
                    </div>
    
                    <script>
                        activeMethodPill2 = document.querySelector('.method-pill.active-pill');
                        activeScenePill2 = document.querySelector('.scene-pill.active-pill');
                        activeModePill2 = document.querySelector('.mode-pill.active-pill');
                    </script>
                </div>
            </div>
            <div class="col-2"></div>
        </div>
    </div>
    

    <div class="container">
        <div class="row py-5 align-item-center">
            <div class="col-md-2"></div>
            <div class="col-md-8 text-center h-100">
                <h2 class="fst-italic">Inpainting priors are great for occlusion reasoning</h2>
                <p class="text-justify pt-4">
                Using text conditioned 2D diffusion models for 3D scene generation is tricky given the lack of 3D consistency across different samples. We mitigate this by leveraging 2D inpainting priors as novel view estimators instead. By rendering an incomplete 3D model and inpainting unknown regions, we learn to generate consistent 3D scenes.
                </p>
            </div>
            <div class="col-md-2"></div>
        </div>
    </div>

    <div class="container-fluid bg-secondary text-white" id="single">
        <div class="row py-5 align-item-center">
            <div class="col-md-2"></div>
            <div class="col-md-8 text-center h-100">
                <h2 class=""><b> Image to 3D</b></h2>
                <p class="text-justify pt-4">
                    We show that our technique can generate 3D scenes from a single image. This is a challenging task as it requires the model to hallucinate the missing geometry and texture in the scene. We do not require training on any scene-specific dataset.
                </p>
            </div>
            <div class="col-md-2"></div>
        </div>
        <div class="row py-2 align-item-center">
            <div class="col-md-2"></div>
            <div class="col-md-8 text-center">
                <div class="row">
                    <div class="col-4"> <!-- Adjust col-6 t`o your preference for smaller screens -->
                        <img class="img-fluid" src="thumbnails/gate.png" />
                        <p class="text-center">Input Image</p>
                    </div>
                    <div class="col-8"> <!-- Adjust col-6 to your preference for smaller screens -->
                        <video class="video img-fluid" loop autoplay muted>
                            <source src="videos/single/gate.mp4" />
                        </video>
                    </div>
                </div>
                <div class="row">
                    <span class="xkcd" style="font-size: 1.5em;">"The Brandenburg Gate in Berlin, large stone gateway with series of columns and a sculpture of a chariot and horses on stop, clear sky, 4k image, photorealistic"</span>
                </div>
            </div>
            
            <div class="col-md-2"></div>
        </div>
        <div class="row py-5 align-item-center">
            <div class="col-md-2"></div>
            <div class="col-md-8 text-center">
                <div class="row">
                    <div class="col-4"> <!-- Adjust col-6 to your preference for smaller screens -->
                        <img class="img-fluid" src="thumbnails/conference.png" />
                        <p class="text-center">Input Image</p>
                    </div>
                    <div class="col-8"> <!-- Adjust col-6 to your preference for smaller screens -->
                        <video class="video img-fluid" loop autoplay muted>
                            <source src="videos/single/conference.mp4" />
                        </video>
                    </div>
                </div>
                <div class="row">
                    <span class="xkcd" style="font-size: 1.5em;">"A minimal conference room, with a long table, a screen on the wall and a whiteboard, 4k image, photorealistic, sharp"</span>
                </div>
            </div>
            
            <div class="col-md-2"></div>
        </div>
    </div>


</body>
</html>
