<!DOCTYPE html>
<html>
<head>
  <!-- Google tag (gtag.js) -->
  <script async src="https://www.googletagmanager.com/gtag/js?id=G-W3K3MQ8SK4"></script>
  <script>
    window.dataLayer = window.dataLayer || [];
    function gtag(){dataLayer.push(arguments);}
    gtag('js', new Date());

    gtag('config', 'G-W3K3MQ8SK4');
  </script>
  <meta charset="utf-8">
  <meta name="viewport" content="width=device-width, initial-scale=1">

  <meta property="og:type" content="website" />
  <meta property="og:site_name" content="Genova" />
  <meta property="og:url" content="https://token-verse.github.io/" />
  <meta property="og:title" content="Genova" />
  <meta property="og:description" content="Advancing Subject Consistent and Textual Alignment Personalized Image Generation via Precise Attribute Learning." />
  <meta property="og:image" content="https://token-verse.github.io/static/images/teaser.jpg" />
  <title>Genova</title>

  <link href="https://fonts.googleapis.com/css?family=Google+Sans|Noto+Sans|Castoro"
        rel="stylesheet">

  <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/bootstrap@4.3.1/dist/css/bootstrap.min.css" integrity="sha384-ggOyR0iXCbMQv3Xipma34MD+dH/1fQ784/j6cY/iJTQUOhcWr7x9JvoRxT2MZw1T" crossorigin="anonymous">
  <script src="https://cdn.jsdelivr.net/npm/popper.js@1.14.7/dist/umd/popper.min.js" integrity="sha384-UO2eT0CpHqdSJQ6hJty5KVphtPhzWj9WO1clHTMGa3JDZwrnQq4sF86dIHNDz0W1" crossorigin="anonymous"></script>
  <script src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script>
  <script src="https://cdn.jsdelivr.net/npm/bootstrap@4.3.1/dist/js/bootstrap.min.js" integrity="sha384-JjSmVgyd0p3pXB1rRibZUAYoIIy6OrQ6VrjIEaFf/nJGzIxFDsf4x0xIM+B07jRM" crossorigin="anonymous"></script>

  <link rel="stylesheet" href="./static/css/bulma.min.css">
  <link rel="stylesheet" href="./static/css/bulma-carousel.min.css">
  <link rel="stylesheet" href="./static/css/bulma-slider.min.css">
  <link rel="stylesheet"
        href="https://cdn.jsdelivr.net/gh/jpswalsh/academicons@1/css/academicons.min.css">
  <link rel="stylesheet" href="static/css/index.css">
  <link rel="icon" href="./static/favicon.ico">

  <script defer src="./static/js/fontawesome.all.min.js"></script>
  <script src="./static/js/bulma-carousel.min.js"></script>
  <script src="./static/js/bulma-slider.min.js"></script>
  <script src="./static/js/index.js"></script>
  <script src='https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.4/MathJax.js?config=default'></script>
  <!-- mathjax -->
  <script id="MathJax-script" src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js"></script>


</head>
<body>


<section class="hero">
  <div class="hero-body">
    <div class="container is-max-desktop">
      <div class="columns is-centered">
        <div class="column has-text-centered">
            <h1 class="title is-2 publication-title"></span>Towards<span style="
              font-weight: bold;
              -webkit-background-clip: text;
              color: #DD2476;
            ">
			  Subject-Consistent and Text-Aligned
			</span>Personalized Image Generation <br>via Precise Attribute Learning</h1>
          <!--<h1 class="title is-2 publication-title">Advancing Subject Consistent and Textual Alignment Personalized Image Generation via Precise Attribute Learning</h1>-->
        </div>
      </div>
    </div>
  </div>
</section>


<section class="section hero is-light is-small">
    <div class="container is-max-desktop">
    <div class="columns is-centered has-text-centered">
      <div class="column is-four-fifths">
        <h2 class="title is-3">📝 Abstract</h2>
        <div class="content has-text-justified">
          <p>
			Recent advances in personalized image generation using Diffusion Transformers (DiTs) have shown remarkable progress. 
			However, existing approaches face a trade-off between textual alignment and maintaining reference subjects. 
			This issue primarily stems from the fact that directly injecting subject tokens may disrupt the sampling trajectory of the base model, while the methods through textual inversion struggle to capture detailed attributes of the subject. 
			To address these limitations, we introduce a DiT based subject-driven generation framework <b>Genova</b> with an innovative attribute learning module. 
			This attribute learning module integrates subject image tokens to improve the text-stream modulation, enhancing the representation of the subject's visual attributes distinctly. 
			Contrary to traditional modulation techniques in DiTs, our proposed framework leverages the hierarchical features from the subject image tokens, facilitating more effective attribute learning. 
			This enhancement allows for precise semantic understanding of the subject, thereby optimizing the model's inherent capabilities for textual alignment and enabling more flexible and controllable image generation. 
			Moreover, we develop a synthetic dataset <b>CoupleX</b> featuring subject-paired samples that focus on depicting the activities and interactions within natural scenes, providing a richer context than previous datasets. 
			Extensive experiments demonstrate that our method outperforms current state-of-the-art methods and achieves subject and prompt consistent personalized image generation.
          </p>
        </div>
      </div>
    </div>
  </div>
</section>

<section class="section hero is-small is-centered" style="align-content: center;">
  <div class="container is-max-desktop">
    <div class="columns is-centered has-text-centered">
    <div class="column is-four-fifths">
      <h2 class="title is-3">🔍 Analysis</h2>
      <img loading="lazy" src="static/images/motivation.png">
      <div class="content has-text-justified is-centered">
        <br />
        <p>
        Comparison of our method Genova with two types of existing subject-driven generation methods.
        </p>
        <p>
        (a) The first-type methods achieve subject-driven generation via token injection. These methods heavily depend on the subject image and face challenges in text alignment. 
        </p>
        <p>
        (b) The second-type methods achieve subject-driven generation via specialized text embeddings. These methods struggle with maintaining subject consistency. 
        </p>
        <p>
        (c) In contrast, our method achieves both subject consistency and text alignment through hierarchical attribute learning for enhanced modulation.
        </p>
      </div>
    </div>
  </div>
</div>
</section>


<section class="section hero is-small is-centered" style="align-content: center;">
  <div class="container is-max-desktop">
    <div class="columns is-centered has-text-centered">
    <div class="column is-four-fifths">
      <h2 class="title is-3">🧪 Method</h2>
      <img loading="lazy" src="static/images/pipeline.png">
      <div class="content has-text-justified is-centered">
        <br />
        <p>
        (a) Overview of our proposed Genova framework. The text tokens, noisy image tokens, and subject image tokens are input into the DiT model. Each DiT block includes a proposed attribute learning module, followed by the modulation mechanism. 
		Then the modulation offsets \(\Delta_{attribute}\), which reveals that the specific subject attributes are applied to enhance the control of the semantic text in MM-attention. 
        </p>
        <p>
        (b) Details of the attribute learning module. This module processes subject image tokens (from block i-1) and attribute tokens through subject-driven self-attention, 
		which enhances the semantic understanding of the attribute tokens by incorporating hierarchical texture features from the subject image.
        </p>
      </div>
    </div>
  </div>
</div>
</section>


<section class="section carousel-sec">
  <div class="container is-max-desktop is-centered has-text-centered">
    <h2 class="title is-3">✨ Single-subject Personalization</h2>
    <div class="container is-max-desktop">
      <div class="hero-body">
        <div id="results-carousel" class="carousel results-carousel" data-slides-to-show="1">
          <img loading="lazy" src="results/single_subjects/r1.png" style="width: 100%;">
          <img loading="lazy" src="results/single_subjects/r2.png" style="width: 100%;">
          <img loading="lazy" src="results/single_subjects/r3.png" style="width: 100%;">
          <img loading="lazy" src="results/single_subjects/r4.png" style="width: 100%;">
          <img loading="lazy" src="results/single_subjects/r5.png" style="width: 100%;">
          <img loading="lazy" src="results/single_subjects/r6.png" style="width: 100%;">
          <img loading="lazy" src="results/single_subjects/r7.png" style="width: 100%;">
        </div>
      </div>
    </div> 
  </div>
</section>

<section class="section carousel-sec">
  <div class="container is-max-desktop is-centered has-text-centered">
    <h2 class="title is-3">✨ Multiple-subject Personalization</h2>
    <div class="container is-max-desktop">
      <div class="hero-body">
        <div id="results-carousel" class="carousel results-carousel" data-slides-to-show="1">
          <img loading="lazy" src="results/multi_subjects/m4.png" style="width: 100%;">
          <img loading="lazy" src="results/multi_subjects/m1.png" style="width: 100%;">
          <img loading="lazy" src="results/multi_subjects/m2.png" style="width: 100%;">
          <img loading="lazy" src="results/multi_subjects/m3.png" style="width: 100%;">
        </div>
      </div>
    </div> 
  </div>
</section>


</body>
</html>
