iLRM: An Iterative Large 3D Reconstruction Model

Gyeongjin Kang; Seungtae Nam; Seungkwon Yang; Xiangyu Sun; Sameh Khamis; Abdelrahman Mohamed; Eunbyung Park

iLRM: An Iterative Large 3D Reconstruction Model

Gyeongjin Kang, Seungtae Nam, Seungkwon Yang, Xiangyu Sun, Sameh Khamis, Abdelrahman Mohamed, Eunbyung Park

19 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Feed-forward 3D reconstruction, 3D Gaussian Splatting, Multi-view geometry

TL;DR: iLRM is a scalable feed-forward 3D reconstruction model that overcomes the inefficiency of prior Gaussian-based approaches when handling many views or high-resolution inputs.

Abstract: Feed-forward 3D modeling has emerged as a promising approach for rapid and high-quality 3D reconstruction. In particular, directly generating explicit 3D representations, such as 3D Gaussian splatting, has attracted significant attention due to its fast and high-quality rendering, as well as numerous applications. However, many state-of-the-art methods, primarily based on transformer architectures, suffer from severe scalability issues because they rely on full attention across image tokens from multiple input views, resulting in prohibitive computational costs as the number of views or image resolution increases. Toward a scalable and efficient feed-forward 3D reconstruction, we introduce an iterative Large 3D Reconstruction Model (*iLRM*) that generates 3D Gaussian representations through an iterative refinement mechanism, guided by three core principles: (1) decoupling the scene representation from input-view images to enable *compact 3D representations*; (2) decomposing fully-attentional multi-view interactions into a *two-stage attention* scheme to reduce computational costs; and (3) injecting *high-resolution information at every layer* to achieve high-fidelity reconstruction. Experimental results on widely used datasets, such as RE10K and DL3DV, demonstrate that iLRM outperforms existing methods in both reconstruction quality and speed.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 15822

Loading