Multi-view 3D Reconstruction from Video with Transformer

Yijie Zhong; Zhengxing Sun; Yunhan Sun; Shoutong Luo; Yi Wang; Wei Zhang

Multi-view 3D Reconstruction from Video with Transformer

Yijie Zhong, Zhengxing Sun, Yunhan Sun, Shoutong Luo, Yi Wang, Wei Zhang

Published: 01 Jan 2022, Last Modified: 04 Nov 2024ICIP 2022EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Multi-view 3D reconstruction is the base for many other applications in computer vision. Video provides multi-view images and temporal information, which can help us better complete the reconstruction goal. Redundant information handling in video and multi-view feature extraction and fusion become the key issues in the shape prior extraction for reconstruction. In this paper, inspired by the recent great success in Transformer models, we propose a transformer-based 3D reconstruction network. We formulate the multi-view 3D reconstruction into three parts: frame encoder, fusion module, and shape decoder. We apply several special used tokens and perform the fusion progressively in the encoder phase, called patch-level progressive fusion module. These tokens describe which part of the object the frame should focus on and the local structural detail progressively. Then we further design a transformer fusion module to aggregate the structure information. Finally, multi-head attention is utilized to build the transformer-based decoder to reuse the shallow features from encoder. In experiments not only can ours method achieve competitive performance, but it also has low model complexity and computation cost.

Loading