<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8" />
  <meta name="viewport" content="width=device-width, initial-scale=1.0" />
  <link rel="shortcut icon" href="./images/logo.ico" type="image/x-icon">
  <title>IR3D-Bench</title>
  <link href="https://fonts.googleapis.com/css2?family=Outfit:wght@300;400;600;700&display=swap" rel="stylesheet">


  <link rel="stylesheet" href="./static/css/bulma.min.css">
  <link rel="stylesheet" href="./static/css/bulma-carousel.min.css">
  <link rel="stylesheet" href="./static/css/bulma-slider.min.css">
  <link rel="stylesheet" href="./static/css/fontawesome.all.min.css">
  <link rel="stylesheet"
        href="https://cdn.jsdelivr.net/gh/jpswalsh/academicons@1/css/academicons.min.css">
  <link rel="stylesheet" href="./static/css/index.css">

  <script src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script>
  <script defer src="./static/js/fontawesome.all.min.js"></script>
  <script src="./static/js/bulma-carousel.min.js"></script>
  <script src="./static/js/bulma-slider.min.js"></script>
  <script src="./static/js/index.js"></script>

  <!-- custom additional scripts -->

  <link rel="stylesheet"
  href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.4.0/css/all.min.css">
  <!-- https://docs.mathjax.org/en/latest/web/configuration.html#configuration-using-an-in-line-script -->
  <script>
    MathJax = {
      tex: {
        inlineMath: [['$', '$'], ['\\(', '\\)']]
      },
      svg: {
        fontCache: 'global'
      }
    };
  </script>
  <script type="text/javascript" id="MathJax-script" async
    src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-svg.js">
  </script>

  <script
  defer
  src="https://cdn.jsdelivr.net/npm/img-comparison-slider@8/dist/index.js"
  ></script>
  <link
    rel="stylesheet"
    href="https://cdn.jsdelivr.net/npm/img-comparison-slider@8/dist/styles.css"
  />

  <!-- swiper -->
  <!-- https://swiperjs.com/demos -->
  <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/swiper@10/swiper-bundle.min.css"/>
  <script src="https://cdn.jsdelivr.net/npm/swiper@10/swiper-bundle.min.js"></script>

  <style>
    :root {
      --bg: #f4f7fb;
      --card-bg-1: #ffffff;
      --card-bg-2: #f8faff;
      --border: #d0d7e4;
      --text-color: #1e293b;
      --subtext-color: #475569;
      --accent: #2563eb;
      --accent-light: #60a5fa;
    }

    * {
      box-sizing: border-box;
      margin: 0;
      padding: 0;
    }

    body {
      font-family: 'Outfit', sans-serif;
      background-color: var(--bg);
      color: var(--text-color);
      line-height: 1.7;
      background-image: radial-gradient(#e0e7ff 1px, transparent 1px);
      background-size: 30px 30px;
      font-weight: 400;
    }

    header {
      text-align: center;
      padding: 4rem 2rem 2rem;
      position: relative;
    }

    header h1 {
      font-size: 3.2rem;
      font-weight: 400;
      background: linear-gradient(to right, var(--accent), var(--accent-light));
      -webkit-background-clip: text;
      -webkit-text-fill-color: transparent;
      margin-bottom: 1rem;
      letter-spacing: -0.02em;
      line-height: 1.2;
    }

    header p {
      font-size: 1.15rem;
      color: var(--subtext-color);
      max-width: 720px;
      margin: 0 auto;
      font-weight: 300;
    }

    .container {
      max-width: 1000px;
      margin: 2rem auto;
      padding: 0 1.5rem;
    }

    section {
      background: linear-gradient(135deg, var(--card-bg-1), var(--card-bg-2));
      border: 1px solid var(--border);
      border-radius: 16px;
      padding: 2rem;
      margin-bottom: 2rem;
      box-shadow: 0 8px 30px rgba(0, 0, 0, 0.04);
      transition: transform 0.3s ease, box-shadow 0.3s ease;
      animation: fadeIn 0.6s ease;
      position: relative;
      overflow: hidden;
    }

    section::before {
      content: '';
      position: absolute;
      top: 0;
      left: 0;
      width: 100%;
      height: 5px;
      background: linear-gradient(to right, var(--accent), var(--accent-light));
    }

    section::after {
      content: '';
      position: absolute;
      bottom: 0;
      right: 0;
      width: 100px;
      height: 100px;
      background: radial-gradient(circle at bottom right, rgba(96, 165, 250, 0.1), transparent 70%);
      z-index: 0;
      border-radius: 50%;
    }

    section:hover {
      transform: translateY(-3px);
      box-shadow: 0 12px 40px rgba(0, 0, 0, 0.08);
    }

    section h2 {
      font-size: 1.6rem;
      font-weight: 600;
      color: var(--accent);
      margin-bottom: 1rem;
      position: relative;
      z-index: 1;
    }

    section p, section ul {
      font-size: 1.05rem;
      color: var(--subtext-color);
      margin-bottom: 1rem;
      position: relative;
      z-index: 1;
      font-weight: 300;
      letter-spacing: -0.01em;
    }

    ul {
      padding-left: 1.5rem;
    }

    li {
      margin-bottom: 0.5rem;
      font-weight: 300;
    }

    .placeholder {
      width: 100%;
      height: 200px;
      background: linear-gradient(45deg, #e2e8f0 25%, #f1f5f9 25%, #f1f5f9 50%, #e2e8f0 50%, #e2e8f0 75%, #f1f5f9 75%);
      background-size: 20px 20px;
      border: 2px dashed #cbd5e1;
      border-radius: 12px;
      color: #64748b;
      font-size: 0.95rem;
      font-weight: 400;
      display: flex;
      align-items: center;
      justify-content: center;
      margin-bottom: 1.5rem;
      transition: all 0.3s ease;
    }

    .placeholder:hover {
      background-size: 25px 25px;
      transform: scale(1.01);
    }

    .image-container {
      width: 100%;
      margin-bottom: 1.5rem;
      border-radius: 12px;
      overflow: hidden;
      box-shadow: 0 4px 20px rgba(0, 0, 0, 0.08);
      transition: transform 0.3s ease;
    }

    .image-container:hover {
      transform: scale(1.01);
    }

    .full-width-image {
      width: 100%;
      display: block;
      border-radius: 12px;
      margin-bottom: 2rem;
    }

    .pdf-object {
      width: 100%;
      height: 500px;
      border-radius: 12px;
    }

    pre {
      background: #f8faff;
      padding: 1.2rem;
      border-radius: 12px;
      overflow-x: auto;
      font-size: 0.9rem;
      color: #334155;
      border-left: 4px solid var(--accent-light);
      font-family: monospace;
      font-weight: 400;
    }

    footer {
      text-align: center;
      padding: 2rem;
      font-size: 0.9rem;
      color: #94a3b8;
      font-weight: 300;
    }

    @keyframes fadeIn {
      from {
        opacity: 0;
        transform: translateY(20px);
      }
      to {
        opacity: 1;
        transform: translateY(0);
      }
    }

    @media (max-width: 600px) {
      header h1 {
        font-size: 2.2rem;
      }

      section {
        padding: 1.5rem;
      }
    }

    .pipeline-list {
      margin-left: 1.5rem;
      margin-bottom: 1.5rem;
    }
    
    .pipeline-list li {
      margin-bottom: 1rem;
      font-weight: 300;
    }
    
    .pipeline-list ul {
      margin-top: 0.5rem;
      margin-bottom: 0.5rem;
    }
    
    .pipeline-list strong {
      font-weight: 500;
      color: var(--accent);
    }

    .first { background: #ff9800; color: #fff; font-weight: bold; }
    .second { background: #ffc04d; }
    .third { background: #ffe0b2; }
    .sectionbluebg { background: #f5faff; font-weight: bold; }
    .center { text-align: center; }
    .fail { color: red; font-size: 1.5em; }
    .small { font-size: 0.85em; }

    body, .container, section, .column, .publication-links {
      text-align: left !important;
    }
    .publication-links {
      text-align: center !important;
    }
    h3 {
      font-size: 1.35rem;
      font-weight: 700;
    }
    .section-underline {
      width: 100%;
      height: 2px;
      background: #e5e7eb;
      margin: 0.5rem 0 1.2rem 0;
      border-radius: 1px;
    }
  </style>
</head>
<body>

  <header>
    <div class="header-flex" style="display: flex; align-items: center; justify-content: center; gap: 1.2rem;">
      
      <h1 style="font-weight: 400; line-height: 1.2; margin: 0;">
        
        <t1>
          <span style="display: inline-flex; align-items: center; gap: 0.3rem;">
            <img src="./images/logo.png" alt="IR3D-Bench Logo" style="height:56px; width:auto; display:block;">
            <span style="background: linear-gradient(to right, #0a2463, #93c5fd); -webkit-background-clip: text; -webkit-text-fill-color: transparent;">IR3D-Bench</span>: Evaluating Vision-Language Model 
          </span>

          <br>Scene Understanding as Agentic Inverse Rendering
        </t1>
      </span>
    </div>
    <p><em style="font-size: 1.5rem;">What I cannot create, I do not understand. ---Richard Feynman</em></p>
    <p style="font-size: 2rem; color: #000000; margin-top: 1rem; font-weight: 400;">Anonymous Author(s)</p>
  </header>

  <div class="column has-text-centered">
    <div class="publication-links">
      <!-- PDF Link. -->
      <span class="link-block">
        <a href=""
           class="external-link button is-normal is-rounded is-dark">
          <span class="icon">
              <i class="fas fa-file-pdf"></i>
          </span>
          <span>Paper</span>
        </a>
      </span>
      <!-- Video Link. -->
      <span class="link-block">
        <a href="https://huggingface.co/datasets/Piang/IR3D-bench"
           class="external-link button is-normal is-rounded is-dark">
          <span class="icon">
              <i class="fa fa-database"></i>
          </span>
          <span>Dataset</span>
        </a>
      </span>
      <!-- Code Link. -->
      <span class="link-block">
        <a href="https://anonymous.4open.science/r/IR3D-bench-8EB2"
           class="external-link button is-normal is-rounded is-dark">
          <span class="icon">
              <i class="fab fa-github"></i>
          </span>
          <span>Code</span>
          </a>
      </span>
    </div>


  <div class="container">
    <section>
        <h2>Motivation</h2>
        <div class="image-container">
            <img src="./images/teaser.png" alt="IR3D-Bench Pipeline Diagram" class="full-width-image" style="margin-bottom: 0;">
        </div>
        <p>
            Humans demonstrate true understanding through creation and recreate observed scenes because we genuinely comprehend spatial relationships and physical attributes. 
            In contrast, current Vision-Language Agents (VLAs) are primarily evaluated on recognition tasks like captioning or QA, which fail to assess deeper understanding. 
            <strong>Can VLAs truly understand what they see ?</strong> 
            <span style="background: linear-gradient(to right, #0a2463, #39a0ed); -webkit-background-clip: text; -webkit-text-fill-color: transparent;">IR3D-Bench</span> test it by letting them recreating the observations.
        </p>
    </section>


    <section>
      <h2>Abstract</h2>
      <p>
        Vision-language models (VLMs) excel at descriptive tasks, but whether they truly understand scenes from visual observations remains uncertain.
We introduce <span style="background: linear-gradient(to right, #0a2463, #39a0ed); -webkit-background-clip: text; -webkit-text-fill-color: transparent;">IR3D-Bench</span>, a benchmark challenging VLMs to demonstrate understanding through active creation rather than passive recognition.
Grounded in the analysis-by-synthesis paradigm, <span style="background: linear-gradient(to right, #0a2463, #39a0ed); -webkit-background-clip: text; -webkit-text-fill-color: transparent;">IR3D-Bench</span> tasks Vision-Language Agents (VLAs) with actively using programming and rendering tools to recreate the underlying 3D structure of an input image, achieving agentic inverse rendering through tool use.
This "understanding-by-creating" approach probes the tool-using generative capacity of VLAs, moving beyond the descriptive or conversational capacity measured by traditional scene understanding benchmarks.
We provide a comprehensive suite of metrics to evaluate geometric accuracy, spatial relations, appearance attributes, and overall plausibility.
Initial experiments on agentic inverse rendering powered by various state-of-the-art VLMs highlight current limitations, particularly in visual precision rather than basic tool usage.
<span style="background: linear-gradient(to right, #0a2463, #39a0ed); -webkit-background-clip: text; -webkit-text-fill-color: transparent;">IR3D-Bench</span>, including data and evaluation protocols, is released to facilitate systematic study and development of tool-using VLAs towards genuine scene understanding by creating.
      </p>
    </section>

    <section>
      <h2>Pipeline Overview</h2>

      <div class="image-container">
        <img src="./images/pipeline.png" alt="IR3D-Bench Pipeline Diagram" class="full-width-image" style="margin-bottom: 0;">
      </div>
      <p>
        Overview of the <span style="background: linear-gradient(to right, #0a2463, #39a0ed); -webkit-background-clip: text; -webkit-text-fill-color: transparent;">IR3D-Bench</span> Pipeline:
      </p>
      <ol class="pipeline-list">
        <li><strong>Stage 1: Inverse Rendering</strong>: Given a raw image and camera parameters, the agent is prompted to infer a structured scene representation in JSON format. The predicted objects are rendered in Blender and matched to GT annotations using geometric alignment and per-object mask comparisons.</li>
        <li><strong>Stage 2: Benchmark Evaluation</strong> these metrics provide a comprehensive view of the VLA's internal world model and generative precision:
          <ul>
            <li><strong>Localization</strong>: Object count, spatial alignment, and relation consistency</li>
            <li><strong>Visual Appearance</strong>: Shape and material accuracy via mask- and attribute-level scores</li>
            <li><strong>Language-Aligned Semantics</strong>: Layout fidelity and object plausibility assessed via GPT-4o</li>
          </ul>
        </li>
      </ol>
    </section>


    <section>
      <h2>Evaluation Results</h2>
      <div class="section-underline"></div>
      <h3>Holistic comparison over Metrics Suite</h3>
      <div class="image-container" style="position: relative; width: 50%; margin: 0.2rem auto 0.5rem auto;">
        <img src="./images/radar.png" alt="IR3D-Bench Pipeline Diagram" class="full-width-image" style="margin-bottom: 0;">
      </div>

      <h3>Visual Results</h3>
      <div class="image-container" style="position: relative; width: 80%; margin: 0.2rem auto 0.5rem auto;">
        <img src="./images/main_results.png" alt="IR3D-Bench Pipeline Diagram" class="full-width-image" style="margin-bottom: 0;">
      </div>
        <h4>Conclusion:</h4>
        <ul>
          <li>Gemini-2.5-pro demonstrates strong understanding of object spatial positions and relative layouts.</li>
          <li>Grok-3 excels at modeling fine-grained details such as material and color.</li>
          <li>Qwen2.5-VL-72B struggles in more complex scenarios.</li>
        </ul>

      <div class="section-underline"></div>
      <h3>Iterative Refinements</h3>
      <div class="image-container" style="position: relative; width: 80%; margin: 0.2rem auto 0.5rem auto;">
        <img src="./images/refine_re.png" alt="IR3D-Bench Pipeline Diagram" class="full-width-image" style="margin-bottom: 0;">
      </div>
      <div class="image-container" style="position: relative; width: 80%; margin: 0.2rem auto 0.5rem auto;">
        <img src="./images/refine_with_grid.png" alt="IR3D-Bench Pipeline Diagram" class="full-width-image" style="margin-bottom: 0;">
      </div>

      <h4>Conclusion:</h4>
      As the number of refinements increases, 
      the performance of cases that performed poorly on gpt-4o gradually improves, even outperforming Gemini-2.5-pro.

      
    </section>

    <section>
      <h2>BibTeX</h2> 
      <p style="margin-bottom: 1.5rem;">If you find our work useful, please consider citing our paper:</p>
      
<pre>@inproceedings{ir3dbench2025,
  title={IR3D-Bench: Evaluating Vision-Language Model Scene Understanding as Agentic Inverse Rendering},
  author={Anonymous Authors},
  booktitle={NeurIPS 2025 submission},
  year={2025}
}</pre>
    </section>

  </div>

  <footer>
    &copy; 2025 IR3D-Bench Research Team · All Rights Reserved; Source code borrowed from <a href="https://github.com/nerfies/nerfies">Nerfies</a>.
  </footer>

</body>
</html>