VAT: Visibility Aware Transformer for Fine-Grained Clothed Human Reconstruction

Xiao-Yan Zhang, Zibin Zhu, Hong Xie, Sisi Ren, Jianmin Jiang

Published: 2025, Last Modified: 02 Apr 2026IEEE Trans. Vis. Comput. Graph. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: In order to reconstruct 3D clothed human with accurate fine-grained details from sparse views, we propose a deep cooperating two-level global to fine-grained reconstruction framework that constructs robust global geometry to guide fine-grained geometry learning. The core of the framework is a novel visibility aware Transformer VAT, which bridges the two-level reconstruction architecture by connecting its global encoder and fine-grained decoder with two pixel-aligned implicit functions, respectively. The global encoder fuses semantic features of multiple views to integrate global geometric features. In the fine-grained decoder, visibility aware attention mechanism is designed to efficiently fuse multi-view and multi-scale features for mining fine-grained geometric features. The global encoder and fine-grained decoder are connected by a global embeding module to form a deep cooperation in the two-level framework, which provides global geometric embedding as a query guidance for calculating visibility aware attention in the fine-grained decoder. In addition, to extract highly aligned multi-scale features for the two-level reconstruction architecture, we design an image feature extractor MSUNet, which establishes strong semantic connections between different scales at minimal cost. Our proposed framework is end-to-end trainable, with all modules jointly optimized. We validate the effectiveness of our framework on public benchmarks, and experimental results demonstrate that our method has significant advantages over state-of-the-art methods in terms of both fine-grained performance and generalization.