Multi-view 3D Reconstruction from Video with Transformer

Published: 01 Jan 2022, Last Modified: 04 Nov 2024ICIP 2022EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Multi-view 3D reconstruction is the base for many other applications in computer vision. Video provides multi-view images and temporal information, which can help us better complete the reconstruction goal. Redundant information handling in video and multi-view feature extraction and fusion become the key issues in the shape prior extraction for reconstruction. In this paper, inspired by the recent great success in Transformer models, we propose a transformer-based 3D reconstruction network. We formulate the multi-view 3D reconstruction into three parts: frame encoder, fusion module, and shape decoder. We apply several special used tokens and perform the fusion progressively in the encoder phase, called patch-level progressive fusion module. These tokens describe which part of the object the frame should focus on and the local structural detail progressively. Then we further design a transformer fusion module to aggregate the structure information. Finally, multi-head attention is utilized to build the transformer-based decoder to reuse the shallow features from encoder. In experiments not only can ours method achieve competitive performance, but it also has low model complexity and computation cost.
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview