<!DOCTYPE html>
<html>
  <head>
    <meta charset="utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <title>3D-aware Disentangled Representation for Compositional Reinforcement Learning</title>

    <!-- Google Fonts -->
    <link rel="preconnect" href="https://fonts.googleapis.com" />
    <link rel="preconnect" href="https://fonts.gstatic.com" crossorigin />
    <link
      href="https://fonts.googleapis.com/css2?family=Encode+Sans:wght@300;400;500;600&family=Roboto+Mono&display=swap"
      rel="stylesheet"
    />

    <!-- Bulma -->
    <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/bulma@0.9.4/css/bulma.min.css" />
    <link rel="stylesheet" href="./css/styles.css" />
    <script src="./js/bulma_toggle.js"></script>

    <!-- Font Awesome -->
    <script src="https://kit.fontawesome.com/5fd1dd8417.js" crossorigin="anonymous"></script>

    <!-- Academicons -->
    <link rel="stylesheet" href="https://cdn.jsdelivr.net/gh/jpswalsh/academicons@1/css/academicons.min.css" />
  </head>

  <body>
    <!-- title -->
    <section class="hero">
      <div class="hero-body">
        <div class="container is-max-widescreen has-text-centered">
          <!-- title -->
          <h1 class="title is-size-1 is-size-2-mobile publication-title">
            3D-aware Disentangled Representation <br /> for Compositional Reinforcement Learning
          </h1>

          <!-- authors -->
          <div class="container is-max-desktop has-text-centered">
            <div class="columns is-mobile is-centered is-gapless">
              <div class="column is-5-tablet is-size-5-tablet publication-authors">
                Anonymous authors
              </div>
            </div>
          </div>

          <div class="is-size-5-tablet publication-institute">
            <span class="author-block">Under reveiw as a conference paper at ICLR2026 <br> Submission number: 16912</span>
          </div>
        </div>
      </div>
    </section>

    <!-- content section with left-side subtitle, which is aligned to center -->
    <section class="section">
      <div class="container is-max-desktop">
        <!-- abstract -->
        <div class="column is-full-width is-centered has-text-centered has-text-left-mobile">
          <h2 class="subtitle is-size-3 has-text-weight-medium publication-keywords">
            Abstract
          </div>
          <div class="column has-text-justified">
            <p class="content">
              Vision-based reinforcement learning can benefit from object-centric scene representation, which factorizes the visual observation into individual objects and their attributes, such as color, shape, size, and position.
              While such object-centric representations can extract components that generalize well for various multi-object manipulation tasks, they are prone to issues with occlusions and 3D ambiguity of object properties due to their reliance on single-view 2D image features. 
              Furthermore, the entanglement between object configurations and camera poses complicates the object-centric disentanglement in 3D, leading to poor 3D reasoning by the agent in vision-based reinforcement learning applications.
              <br /> 
              To address the lack of 3D awareness and the object-camera entanglement problem, we propose an enhanced 3D object-centric representation that utilizes multi-view 3D features and enforces more explicit 3D-aware disentanglement.
              The enhancement is based on the integration of the recent success of multi-view Transformer and the shared concept memory among the object-centric representations.
              The representation, therefore, can stably identify and track 3D trajectories of individual objects along with their semantic and physical properties, exhibiting excellent interpretability and controllability.
              Then, our proposed block transformer policy effectively performs novel tasks by assembling desired properties adaptive to the new goal states.
              We demonstrate that our 3D-aware block representation is scalable to compose diverse novel scenes and enjoys superior performance in out-of-distribution tasks with multi-object manipulations compared to existing methods.     
            </p>
          </div>
        </div>
      </div>
    </section>

    <section class="section">
      <div class="container is-max-desktop">
        <div class="column is-full-width is-centered has-text-centered has-text-left-mobile">
          <h2 class="subtitle is-size-3 has-text-weight-medium publication-keywords">
              Method
            </h2>
          </div>
          <div class="column has-text-justified">
            <img src="./assets/model_structure.png" width="90%" style="display: block; margin: auto" />
            <p class="content">
              We propose a structured 3D object representation method that learns disentangled object attributes (e.g., shape, color, size) using a novel mechanism called the 3D block-slot attention mechanism.
              Built on top of OSRT, this method enhances interpretability and disentanglement by assigning dedicated slots for the background and agent.
              The resulting representation captures compositional semantics critical for downstream tasks like robotic manipulation.
            </p>
          <div class="column has-text-justified">
            <img src="./assets/block_transformer.png" width="90%" style="display: block; margin: auto" />
            <p class="content">
              To leverage the structured representation, we introduce a block transformer policy for goal-conditioned RL.
              It matches objects in current and goal states based on attributes and performs block-wise cross-attention to reason over their differences.
              Combined with agent features and actions, a self-attention module aggregates information for action prediction.
              This approach enables learning a robust and generalizable policy that succeeds across diverse generalization scenarios, including compositional and out-of-distribution environments.
            </p>
          </div>
        </div>
      </div>
    </section>

    <section class="section">
      <div class="container is-max-desktop">
        <div class="column is-full-width is-centered has-text-centered has-text-left-mobile">
          <h2 class="subtitle is-size-3 has-text-weight-medium publication-keywords">
              Results: Novel View Synthesis
            </h2>
          </div>
          <div style="display: flex; flex-direction: column; gap: 1.5rem;">
            <div style="display: flex; align-items: center; gap: 1rem;">
              <div style="min-width: 60px; font-weight: bold;">Ours</div>
              <video src="./assets/clevr3d_nvs_ours.mp4" width="90%" style="display: block; margin: autol" controls></video>
            </div>
            <div style="display: flex; align-items: center; gap: 1rem;">
              <div style="min-width: 60px; font-weight: bold;">OSRT</div>
              <video src="./assets/clevr3d_nvs_osrt.mp4" width="90%" style="display: block; margin: autol" controls></video>
            </div>
            <div style="display: flex; align-items: center; gap: 1rem;">
              <div style="min-width: 60px; font-weight: bold;">Ours</div>
              <video src="./assets/isaacgym3d_nvs_ours.mp4" width="90%" style="display: block; margin: autol" controls></video>
            </div>
            <div style="display: flex; align-items: center; gap: 1rem;">
              <div style="min-width: 60px; font-weight: bold;">OSRT</div>
              <video src="./assets/isaacgym3d_nvs_osrt.mp4" width="90%" style="display: block; margin: autol" controls></video>
            </div>
          </div>
        </div>
      </div>
    </section>

    <section class="section">
      <div class="container is-max-desktop">
        <div class="column is-full-width is-centered has-text-centered has-text-left-mobile">
          <h2 class="subtitle is-size-3 has-text-weight-medium publication-keywords">
              Results: Block manipulation and Novel View Synthesis
            </h2>
          </div>
          <div class="column has-text-justified">
            <img src="./assets/block_manipulation.png" width="90%" style="display: block; margin: auto" />
            <p class="content">
              blah
            </p>
          <div style="display: flex; flex-direction: column; gap: 1.5rem;">

            <div style="display: flex; align-items: center; gap: 1rem;">
              <div style="min-width: 60px; font-weight: bold;">Color swapping</div>
              <video src="./assets/clevr3d_bs_color.mp4" width="90%" style="display: block; margin: autol" controls></video>
            </div>
            <div style="display: flex; align-items: center; gap: 1rem;">
              <div style="min-width: 60px; font-weight: bold;">Size swapping</div>
              <video src="./assets/clevr3d_bs_size.mp4" width="90%" style="display: block; margin: autol" controls></video>
            </div>
            <div style="display: flex; align-items: center; gap: 1rem;">
              <div style="min-width: 60px; font-weight: bold;">Shape swapping</div>
              <video src="./assets/isaacgym3d_bs_shape.mp4" width="90%" style="display: block; margin: autol" controls></video>
            </div>
            <div style="display: flex; align-items: center; gap: 1rem;">
              <div style="min-width: 60px; font-weight: bold;">Position swapping</div>
              <video src="./assets/isaacgym3d_bs_position.mp4" width="90%" style="display: block; margin: autol" controls></video>
            </div>
          </div>
        </div>
      </div>
    </section>

    <section class="section">
      <div class="container is-max-desktop">
        <div class="column is-full-width is-centered has-text-centered has-text-left-mobile">
          <h2 class="subtitle is-size-3 has-text-weight-medium publication-keywords">
              Results: Generalization in Reinforcement Learning.
          </h2>
        </div>
        <div style="display: flex; justify-content: space-around; gap: 3rem; margin-top: 2rem;">
          <div style="flex: 1; text-align: center;">
            <div style="font-weight: bold; margin-bottom: 1rem; font-size: 1.2rem; ">In-distribution</div>
            <div style="display: flex; justify-content: center; gap: 1.5rem">
              <div>
                <p style="font-weight: bold;">Goal image</p>
                <img src="./assets/id_goal_img.png" alt="Goal image" style="width: 90%; display: block; margin: auto;" \>
              </div>
              <div>
                <p style="font-weight: bold;">Success episode</p>
                <img src="./assets/id_episode.gif" alt="Goal image" style="width: 90%; display: block; margin: auto;" \>
              </div>
            </div>
          </div>
          <div style="flex: 1; text-align: center;">
            <div style="font-weight: bold; margin-bottom: 1rem; font-size: 1.2rem; ">Out-of-distribution</div>
            <div style="display: flex; justify-content: center; gap: 1.5rem">
            <div>
              <p style="font-weight: bold;">Goal image</p>
              <img src="./assets/ood_goal_img.png" alt="Goal image" style="width: 90%; display: block; margin: auto;" \>
            </div>
            <div>
               <p style="font-weight: bold;">Success episode</p>
              <img src="./assets/ood_episode.gif" alt="Goal image" style="width: 90%; display: block; margin: auto;" \>
            </div> 
          </div>
        </div>
      </div>
    </section>

    <section class="section">
      <div class="container is-max-desktop">
        <div style="display: flex; justify-content: space-around; gap: 3rem; margin-top: 2rem;">
          <div style="flex: 1; text-align: center;">
            <div style="font-weight: bold; margin-bottom: 1rem; font-size: 1.2rem; ">Compositional generalization</div>
            <div style="display: flex; justify-content: center; gap: 1.5rem">
              <div>
                <p style="font-weight: bold;">Goal image</p>
                <img src="./assets/cg_goal_img.png" alt="Goal image" style="width: 90%; display: block; margin: auto;" \>
              </div>
              <div>
                <p style="font-weight: bold;">Success episode</p>
                <img src="./assets/cg_episode.gif" alt="Goal image" style="width: 90%; display: block; margin: auto;" \>
              </div>
            </div>
          </div>
          <div style="flex: 1; text-align: center;">
            <div style="font-weight: bold; margin-bottom: 1rem; font-size: 1.2rem; ">Compositional generalization<br>(same color)</div>
            <div style="display: flex; justify-content: center; gap: 1.5rem">
            <div>
              <p style="font-weight: bold;">Goal image</p>
              <img src="./assets/cg_sc_goal_img.png" alt="Goal image" style="width: 90%; display: block; margin: auto;" \>
            </div>
            <div>
               <p style="font-weight: bold;">Success episode</p>
              <img src="./assets/cg_sc_episode.gif" alt="Goal image" style="width: 90%; display: block; margin: auto;" \>
            </div> 
          </div>
        </div>
      </div>
    </section>

    <!-- BibTex sectoion -->

    <!-- Footer section -->

      <!-- license -->
      <div class="content has-text-centered" style="margin-top: 1.6rem">
        <p>
          This website is licensed under a
          <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"
            >Creative Commons Attribution-ShareAlike 4.0 International License</a
          >
        </p>
      </div>
    </footer>
  </body>
</html>
