Focusing: View-Consistent Sparse Voxels for Efficient 3D VAE

ICLR 2026 Conference Submission17507 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: 3D AIGC, 3D VAE
Abstract: High-fidelity 3D generation remains difficult. Although some methods have proposed converting raw meshes to SDFs, it remains a lossy process. TripoSF presented a VAE training paradigm based on a rendering loss to circumvent this lossy SDF conversion, achieving high-precision surface reconstruction. However, because the rendering loss cannot supervise all the VAE outputs in the same way as SDF supervision, it limits detail and scalability.We present **Focusing**, a 3D VAE that improves efficiency by activating only the voxels that matter for a given view. Our key idea is a depth-driven voxel carving performed in the structured latent space: voxels inconsistent with the rendered depth are pruned before decoding. This concentrates learning on locally relevant geometry, reduces attention and decoding costs, and lowers video random access memory (VRAM) usage. To stabilize training and capture fine details, we further introduce an adaptive zooming strategy that adjusts camera intrinsics to keep the number of active voxels within a target range. The VAE is trained with a render-based loss on depth, normals, masks, and perceptual terms, and we add simple regularizers (e.g., sparse-voxel TV and a short warm-up with TSDF supervision) to reduce small holes and speed up convergence. Across standard reconstruction benchmarks, Focusing improves geometric accuracy (CD, F-score) over strong baselines while cutting VRAM consumption, which allows for training the $1024^3$ resolution VAE on as little as 50GB of VRAM. These results show that local, view-consistent sparsity is an effective route to higher-resolution, more efficient 3D VAEs.
Primary Area: generative models
Submission Number: 17507
Loading