<!DOCTYPE html>
<html>

<head>
  <meta charset="utf-8">
  <meta name="description" content="Test-Time Training Done Right">
  <meta name="keywords" content="Test-Time Training, LaCT">
  <meta name="viewport" content="width=device-width, initial-scale=1">
  <title>Test-Time Training Done Right</title>

  <link href="https://fonts.googleapis.com/css?family=Google+Sans|Noto+Sans|Castoro" rel="stylesheet">

  <link rel="stylesheet" href="./static/css/bulma.min.css">
  <link rel="stylesheet" href="./static/css/bulma-carousel.min.css">
  <link rel="stylesheet" href="./static/css/bulma-slider.min.css">
  <link rel="stylesheet" href="./static/css/fontawesome.all.min.css">
  <link rel="stylesheet" href="https://cdn.jsdelivr.net/gh/jpswalsh/academicons@1/css/academicons.min.css">
  <link rel="stylesheet" href="./static/css/index.css">
  <!-- <link rel="icon" href="./static/images/favicon.svg"> -->

  <script src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script>
  <script defer src="./static/js/fontawesome.all.min.js"></script>
  <script src="./static/js/bulma-carousel.min.js"></script>
  <script src="./static/js/bulma-slider.min.js"></script>
  <script src="./static/js/index.js"></script>

  <script async src="https://unpkg.com/es-module-shims@1.8.0/dist/es-module-shims.js"></script>

  <script type="importmap">
    {
      "imports": {
        "three": "https://unpkg.com/three@0.156.1/build/three.module.js",
        "three/controls/OrbitControls": "https://unpkg.com/three@0.156.1/examples/jsm/controls/OrbitControls.js",
        "three/libs/stats": "https://unpkg.com/three@0.156.1/examples/jsm/libs/stats.module.js"
      }
    }
  </script>
  <script type="module" src="https://ajax.googleapis.com/ajax/libs/model-viewer/3.1.1/model-viewer.min.js"></script>

</head>

<body>


  <section class="hero">
    <div class="hero-body">
      <div class="container is-max-desktop">
        <div class="columns is-centered">
          <div class="column has-text-centered">
            <h1 class="title is-1 publication-title">
              <span class="dnerf">Test-Time Training Done Right</span>
            </h1>

            <div class="is-size-5 publication-authors">
              <span class="author-block">Anonymous Authors</span>
              <p><br></p>
              <span class="author-block">
                <strong>
                  Large Chunk Test-Time Training (LaCT) for efficient long-context sequence modeling.
                </strong>
              </span>

              <br />
              <br />
              <div class="columns is-1 is-multiline is-mobile has-text-justified">
                <img src="./static/images/lact_arch.png" alt="Lact Architecture">
                <p><strong>Figure 2</strong>. 
                  The basic diagram for a LaCT block. 
    The large-chunk TTT layer updates the fast weight W to store history chunk information,
    while the window attention handles the internal structures within the chunk. 
    The solid line denotes the information flow over model depth and the dashed line denotes the information flow over time (i.e., the fast weight W passing through chunks).
                </p>
              </div>

            </div>
          </div>
        </div>
      </div>
    </div>
  </section>


  <section class="section">
    <div class="container is-max-desktop">
      <!-- Abstract. -->
      <div class="columns is-centered has-text-centered">
        <div class="column is-four-fifths">
          <h2 class="title is-3">Abstract</h2>
          <div class="content has-text-justified">
            <p>
              
              Test-Time Training (TTT) models context dependencies by adapting
              part of the model's weights (often referred to as fast weights) at
              inference time. This adapted fast weight, similar to recurrent
              states in RNNs, stores temporary memories of past tokens in the
              current sequence. Existing TTT methods have struggled to
              demonstrate effectiveness in handling long-sequence data, due to
              their computational inefficiency on modern GPUs. The TTT layers in
              many of these approaches operate with extremely low FLOPs
              utilization (often below 5%) because they deliberately apply
              small online mini-batch sizes (e.g., updating fast weights every
              16 or 64 tokens).  Moreover, a small mini-batch implies
              fine-grained block-wise causal dependencies in the data, making
              them unsuitable for data beyond 1D ordered sequences, like sets or
              N-dimensional grids such as images or videos.  In contrast, we
              pursue the opposite direction by proposing an extremely large
              chunk update, ranging from 2K to 1M tokens across tasks of varying
              modalities, which we refer to as Large Chunk Test-Time Training
              (LaCT).  This approach improves hardware utilization by orders of
              magnitude, and more importantly, facilitates scaling of nonlinear
              state size (up to 40% of model parameter size), hence
              substantially improving state capacity, all without requiring
              cumbersome and error-prone custom kernel implementations.  It also
              allows easy integration of sophisticated optimizers like Muon for
              online memory updates.  We validate our approach across diverse
              data modalities and tasks, including novel view synthesis from
              image sets, language models, and auto-regressive video diffusion
              models.  Our approach can scale up to 14-billion-parameter
              auto-regressive video diffusion models handling sequences of up to
              56K tokens.  In our longest sequence experiment, we perform novel
              view synthesis with more than one million context length. Our
              results highlight the computational and performance benefits of
              large-chunk test-time training, paving the way for more efficient
              and scalable long-context sequence modeling. We hope that this
              work will inspire and accelerate new research in the field of
              long-context modeling and test-time training. 
              
              <br><br>
            </p>
          </div>
          </object>
        </div>
      </div>
      <img src="./static/images/fig_utilization.png" alt="Hardware Utilization">
      <p class="has-text-justified"><strong>Figure 1</strong>.
        Using larger chunk sizes significantly improves GPU utilization compared
        to the original test-time training (TTT) method that even uses
        customized kernels <b>(a)</b>. This enhanced utilization enables
        efficient scaling to larger state sizes <b>(b)</b>, resulting in
        improved training efficiency and better overall performance
        <b>(c)</b>. The dotted line in <b>(a)</b> is the theoretical peak
        BF16 throughput of the GPU. 
      </p>
    </div>
  </section>


  <section class="hero teaser">
    <div class="container is-max-desktop">
      <h2 class="title is-3">
        <center>Results on Novel View Synthesis (GSO Dataset)</center>
      </h2>

      <h3 class="subtitle is-4">
          <center><a href="./gso_results.html">(Click to see more results)</a></center>
      </h3>
      <center><p>Online compressing 262K tokens (16 images of 1024x1024 resolution) into fast weights.</center>
      <center><p>Rendering at 1024x1024 resolution, best viewed in full screen.</p></center>
      <div class="hero-body">
        <div class="columns is-1 is-multiline is-mobile">
          <div class="column pb-0 mb-0 is-half">
            <video id="teaser" autoplay muted loop playsinline controls  height="100%">
              <source
                src="https://huggingface.co/datasets/tempdatasest/ttt_dataset/resolve/main/gso_results/00000351_turntable_with_input.mp4"
                type="video/mp4">
            </video>
          </div>

          <div class="column pb-0 mb-0 is-half">
            <video id="teaser" autoplay muted loop playsinline controls  height="100%">
              <source
                src="https://huggingface.co/datasets/tempdatasest/ttt_dataset/resolve/main/gso_results/00001007_turntable_with_input.mp4"
                type="video/mp4">
            </video>
          </div>
        </div>


        <div class="columns is-1 is-multiline is-mobile">
          <div class="column pb-0 mb-0 is-half">
            <video id="teaser" autoplay muted loop playsinline controls  height="100%">
              <source
                src="https://huggingface.co/datasets/tempdatasest/ttt_dataset/resolve/main/gso_results/00000758_turntable_with_input.mp4"
                type="video/mp4">
            </video>
          </div>

          <div class="column pb-0 mb-0 is-half">
            <video id="teaser" autoplay muted loop playsinline controls  height="100%">
              <source
                src="https://huggingface.co/datasets/tempdatasest/ttt_dataset/resolve/main/gso_results/00000547_turntable_with_input.mp4"
                type="video/mp4">
            </video>
          </div>
        </div>


      </div>
  </section>


  <section class="hero teaser">
    <div class="container is-max-desktop">
      <h2 class="title is-3">
        <center>Results on Novel View Synthesis (DL3DV Dataset)</center>
      </h2>

      <h3 class="subtitle is-4">
          <center><a href="./dl3dv_results.html">(Click to see more results)</a></center>
      </h3>

      <center><p>Online compressing 1 million tokens (128 images of 960x536 resolution) into fast weights. </p></center>

      <br>

      
    

      <div class="hero-body">
        <div class="columns is-1 is-multiline is-mobile">
          <div class="column pb-0 mb-0 is-half">
            <video id="teaser" autoplay muted loop playsinline controls  height="100%">
              <source
                src="https://huggingface.co/datasets/tempdatasest/ttt_dataset/resolve/main/dl3dv_results_final/00000006_turntable_with_input.mp4"
                type="video/mp4">
            </video>
            <center>
              <a href="https://huggingface.co/datasets/tempdatasest/ttt_dataset/resolve/main/dl3dv_results_final/00000006_input.png"> Input images</a>
            </center>
          </div>

          <div class="column pb-0 mb-0 is-half">
            <video id="teaser" autoplay muted loop playsinline controls  height="100%">
              <source
                src="https://huggingface.co/datasets/tempdatasest/ttt_dataset/resolve/main/dl3dv_results_final/00000108_turntable_with_input.mp4"
                type="video/mp4">
            </video>

            <center>
              <a href="https://huggingface.co/datasets/tempdatasest/ttt_dataset/resolve/main/dl3dv_results_final/00000108_input.png"> Input images</a>
            </center>
          </div>
        </div>


        <div class="columns is-1 is-multiline is-mobile">
          <div class="column pb-0 mb-0 is-half">
            <video id="teaser" autoplay muted loop playsinline controls  height="100%">
              <source
                src="https://huggingface.co/datasets/tempdatasest/ttt_dataset/resolve/main/dl3dv_results_final/00000076_turntable_with_input.mp4"
                type="video/mp4">
            </video>
            <center>
              <a href="https://huggingface.co/datasets/tempdatasest/ttt_dataset/resolve/main/dl3dv_results_final/00000076_input.png"> Input images</a>
            </center>
          </div>

          <div class="column pb-0 mb-0 is-half">
            <video id="teaser" autoplay muted loop playsinline controls  height="100%">
              <source
                src="https://huggingface.co/datasets/tempdatasest/ttt_dataset/resolve/main/dl3dv_results_final/00000035_turntable_with_input.mp4"
                type="video/mp4">
            </video>
            <center>
              <a href="https://huggingface.co/datasets/tempdatasest/ttt_dataset/resolve/main/dl3dv_results_final/00000035_input.png"> Input images</a>
            </center>
          </div>
        </div>



      </div>



    </div>
  </section>



  <section class="hero teaser">
    <div class="container is-max-desktop">
      <h2 class="title is-3">
        <center>Results on Language Modeling</center>
      </h2>
        <img src="./static/images/lm_result_v2.png" alt="Language Modeling">

        <p class="has-text-justified"><strong>Figure 5</strong>.
          Language Model results. 
          <b>(a, c)</b> Our model achieves lower per-position loss at larger
          token indices compared to GLA and DeltaNet at both 760M and 3B scale,
          indicating stronger long-context modeling capability.
          <b>(b, d)</b> Our model consistently outperforms GLA and DeltaNet in
      retrieval accuracy. Furthermore, our Muon variant consistently outperforms
          our Momentum variant. 
          <b>(e)</b> Validation loss at the final 2k tokens improves consistently as
          the attention state size increases from 0.25x to 3x d^2, where d
          denotes the model's feature dimension.
        </p>
        <br>
    </div>
  </section>


  <section class="hero teaser">
    <div class="container is-max-desktop">
      <h2 class="title is-3">
        <center>Results on Autoregressive Video Diffusion</center>
      </h2>


      <h3 class="subtitle is-4">
          <center><a href="./ar_video_results.html">(Click to see more results)</a></center>
      </h3>


      <div class="hero-body">
        <div class="columns is-1 is-multiline is-mobile">
          <div class="column pb-0 mb-0 is-half">
            <video id="teaser" autoplay muted loop playsinline controls  height="100%">
              <source
                src="https://huggingface.co/datasets/tempdatasest/ttt_dataset/resolve/main/ar_video_results_1/output_006.mp4"
                type="video/mp4">
            </video>
            <p class="has-text-justified">
              This close-up shot of a Victoria crowned pigeon showcases its striking blue plumage and red chest. Its crest is made of delicate, lacy feathers, while its eye is a striking red color. The bird's head is tilted slightly to the side, giving the impression of it looking regal and majestic. The background is blurred, drawing attention to the bird's striking appearance.
            </p>
          </div>

          <div class="column pb-0 mb-0 is-half">
            <video id="teaser" autoplay muted loop playsinline controls  height="100%">
              <source
                src="https://huggingface.co/datasets/tempdatasest/ttt_dataset/resolve/main/ar_video_results_1/output_005.mp4"
                type="video/mp4">
            </video>
            <p class="has-text-justified">A gorgeously rendered papercraft world of a coral reef, rife with colorful fish and sea creatures.</p>
          </div>
        </div>


        <div class="columns is-1 is-multiline is-mobile">
          <div class="column pb-0 mb-0 is-half">
            <video id="teaser" autoplay muted loop playsinline controls  height="100%">
              <source
                src="https://huggingface.co/datasets/tempdatasest/ttt_dataset/resolve/main/ar_video_results_1/output_060.mp4"
                type="video/mp4">
            </video>
            <p class="has-text-justified">
              An astronaut runs on the surface of the moon, the low angle shot shows the vast background of the moon, the movement is smooth and appears lightweight.
            </p>
          </div>

          <div class="column pb-0 mb-0 is-half">
            <video id="teaser" autoplay muted loop playsinline controls  height="100%">
              <source
                src="https://huggingface.co/datasets/tempdatasest/ttt_dataset/resolve/main/ar_video_results_1/output_064.mp4"
                type="video/mp4">
            </video>
            <p class="has-text-justified">
              A man riding a horse through the Gobi Desert with a beautiful sunset behind him, movie quality.
            </p>
          </div>
        </div>


        <div class="columns is-1 is-multiline is-mobile">
          <div class="column pb-0 mb-0 is-half">
            <video id="teaser" autoplay muted loop playsinline controls  height="100%">
              <source
                src="https://huggingface.co/datasets/tempdatasest/ttt_dataset/resolve/main/ar_videos_8secs/009.mp4"
                type="video/mp4">
            </video>
            <p class="has-text-justified">
              A woman singing and standing in a concert stage with a bright light in the background.
            </p>
          </div>

          <div class="column pb-0 mb-0 is-half">
            <video id="teaser" autoplay muted loop playsinline controls  height="100%">
              <source
                src="https://huggingface.co/datasets/tempdatasest/ttt_dataset/resolve/main/ar_videos_8secs/013.mp4"
                type="video/mp4">
            </video>
            <p class="has-text-justified">
              A cyclist powering up a steep hill in a road race.
            </p>
          </div>
        </div>


      </div>
    </div>
  </section>

  <footer class="footer">
    <div class="container">
      <div class="columns is-centered">
        <div class="column is-8">
          <div class="content">
            <p>
              We borrow the source code of this website from <a
                href="https://github.com/nerfies/nerfies.github.io">HERE</a>.
            </p>
          </div>
        </div>
      </div>
    </div>
  </footer>

</body>

</html>