Occluded 3D Object Reconstruction via Masked Multi-view Volumetric Transformer

17 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Occlusion, 3D Reconstruction, Amodal Completion
Abstract: Recent advancements in image-to-3D reconstruction and generation have yielded remarkable progress. However, applying these methods to occluded objects in cluttered scenes remains challenging due to the incomplete information in occluded areas. To tackle this issue, we present a feed-forward method for reconstructing 3D occluded shapes using a data-driven approach. Our method utilizes a large-scale dataset of cluttered scenes and incorporates multi-view occlusion-aware 3D reconstruction through a Transformer architecture that draws inspiration from masked autoencoders. Our model, \textit{Masked Multi-view Volumetric Transformer}, utilizes global reasoning from an arbitrary number of multi-view 2D image information and cross-attention between 3D-lifted obstacle mask volumes and volumetric latents, enabling the model to predict information for occluded regions accurately. Furthermore, we have created a synthetic cluttered scene dataset comprising $\sim$30,000 scenes with Objaverse objects, designed to illustrate various occlusion scenarios. Our approach surpasses previous methods in predicting complete shapes from occluded images of unseen objects, achieving completed mesh extraction in five seconds.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 8763
Loading