Native Spatio-Temporal 4D Variational Autoencoder

Lihe Ding; Weicai Ye; Shaocong Dong; Xintao Wang; Pengfei Wan; Kun Gai; Tianfan Xue

Native Spatio-Temporal 4D Variational Autoencoder

Lihe Ding, Weicai Ye, Shaocong Dong, Xintao Wang, Pengfei Wan, Kun Gai, Tianfan Xue

Published: 30 Apr 2026, Last Modified: 24 Jun 2026ICML 2026 regularEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We introduce a novel 4D VAE that operates directly in native 4D space---dynamic colored voxels, preserving explicit spatio-temporal coordinates throughout encoding and decoding.

Abstract: Dynamic 3D content representation is crucial for generating moving 3D objects and scenes. Existing 4D variational autoencoders (VAEs) are mainly based on projected 2D pointmaps, which are only incomplete and view-dependent observations that do not model the native 4D positional relations between points. This often leads to projection-induced distortions and irreversible token dislocation. In this paper, we introduce a novel 4D VAE that operates directly in native 4D space, that is dynamic colored voxel space, without 2D projection. This preserves explicit spatio-temporal coordinates throughout the learned encoder and decoder, enabling both partial and complete 4D content encoding. To support a flexible temporal compression ratio, we also design a novel spatio-temporal window attention module that performs attention within local 4D windows. Additionally, we propose a differentiable voxel rendering loss based on sparse voxel rasterization to improve the geometry and color reconstruction quality. On 4D reconstruction tasks, our approach improves reconstruction fidelity over pointmap VAEs and flow-based VAEs while learning a more structurally consistent latent space. We further demonstrate the generative potential of our method by training a video-conditioned 4D diffusion model.

Lay Summary: 3D content is no longer limited to static shapes: in many applications, such as animation, games, virtual reality, robotics, and simulation, we need 3D scenes and objects that can move, deform, and change over time. However, many existing methods represent dynamic 3D content by first projecting it into 2D views, similar to taking pictures of a moving object from certain angles. This can lose important 3D and time-related information, especially when parts of the object are hidden or distorted by the projection. In this work, we propose a new way to represent dynamic 3D content directly in its original 4D space, where both 3D position and time are preserved. Instead of relying on projected 2D observations, our method works with dynamic colored voxels, which can be understood as small 3D blocks that also change over time. This allows the model to better capture the structure, motion, geometry, and color of moving 3D content. Our experiments show that this direct 4D representation reconstructs dynamic objects more accurately than previous methods and also provides a promising foundation for generating new 4D content from videos.

Originally Submitted Supplementary Material: pdf

Primary Area: Applications->Computer Vision

Keywords: VAE, 4D, Diffusion

Originally Submitted PDF: pdf

Submission Number: 5630

Loading