TL;DR: We introduce a novel 4D VAE that operates directly in native 4D space---dynamic colored voxels, preserving explicit spatio-temporal coordinates throughout encoding and decoding.
Abstract: Dynamic 3D content representation is crucial for generating moving 3D objects and scenes. Existing 4D variational autoencoders (VAEs) are mainly based on projected 2D pointmaps, which are only incomplete and view-dependent observations that do not model the native 4D positional relations between points. This often leads to projection-induced distortions and irreversible token dislocation. In this paper, we
introduce a novel 4D VAE that operates directly in native 4D space, that is dynamic colored voxel space, without 2D projection. This
preserves explicit spatio-temporal coordinates throughout the learned encoder and decoder, enabling both partial and complete 4D
content encoding. To support a flexible temporal compression ratio, we also design a novel spatio-temporal window attention module that performs attention within local 4D windows. Additionally, we propose a differentiable voxel rendering loss based on sparse voxel rasterization to improve the geometry and color reconstruction quality. On 4D reconstruction tasks, our approach improves reconstruction fidelity over pointmap VAEs and flow-based VAEs while learning a more structurally consistent latent space. We further demonstrate the generative potential of our method by training a video-conditioned 4D diffusion model.
Lay Summary: 3D content is no longer limited to static shapes: in many applications, such as animation, games, virtual reality, robotics, and simulation, we need 3D scenes and objects that can move, deform, and change over time.
However, many existing methods represent dynamic 3D content by first projecting it into 2D views, similar to taking pictures of a moving object from certain angles. This can lose important 3D and time-related information, especially when parts of the object are hidden or distorted by the projection.
In this work, we propose a new way to represent dynamic 3D content directly in its original 4D space, where both 3D position and time are preserved. Instead of relying on projected 2D observations, our method works with dynamic colored voxels, which can be understood as small 3D blocks that also change over time. This allows the model to better capture the structure, motion, geometry, and color of moving 3D content.
Our experiments show that this direct 4D representation reconstructs dynamic objects more accurately than previous methods and also provides a promising foundation for generating new 4D content from videos.
Originally Submitted Supplementary Material: pdf
Primary Area: Applications->Computer Vision
Keywords: VAE, 4D, Diffusion
Originally Submitted PDF: pdf
Submission Number: 5630
Loading