Flex3D: Feed-Forward 3D Generation with Flexible Reconstruction Model and Input View Curation

Junlin Han; Jianyuan Wang; Andrea Vedaldi; Philip Torr; Filippos Kokkinos

Flex3D: Feed-Forward 3D Generation with Flexible Reconstruction Model and Input View Curation

Junlin Han, Jianyuan Wang, Andrea Vedaldi, Philip Torr, Filippos Kokkinos

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY-NC 4.0

TL;DR: A two-stage pipeline for generating high-quality 3D assets in a feed-forward manner.

Abstract: Generating high-quality 3D content from text, single images, or sparse view images remains a challenging task with broad applications. Existing methods typically employ multi-view diffusion models to synthesize multi-view images, followed by a feed-forward process for 3D reconstruction. However, these approaches are often constrained by a small and fixed number of input views, limiting their ability to capture diverse viewpoints and, even worse, leading to suboptimal generation results if the synthesized views are of poor quality. To address these limitations, we propose Flex3D, a novel two-stage framework capable of leveraging an arbitrary number of high-quality input views. The first stage consists of a candidate view generation and curation pipeline. In the second stage, the curated views are fed into a Flexible Reconstruction Model (FlexRM), built upon a transformer architecture that can effectively process an arbitrary number of inputs. Through extensive exploration of design and training strategies, we optimize FlexRM to achieve superior performance in both reconstruction and generation tasks. Our results demonstrate that Flex3D achieves state-of-the-art performance, with a user study winning rate of over 92% in 3D generation tasks when compared to several of the latest feed-forward 3D generative models.

Lay Summary: Creating 3D assets from text descriptions or a single picture is a challenging task. Current AI methods often struggle because they usually rely on a few generated 2D multi-view images to build the final 3D asset. If these generated multi-view images are low-quality, inconsistent, or don't show enough angles, the final 3D model can look unrealistic or incomplete. Our system, Flex3D, tackles this problem with a two-stage approach. First, it generates a large and diverse pool of candidate 2D multi-view images. Then, it curates these, selecting an optimal subset of high-quality, consistent multi-view images. These curated views are fed into our novel Flexible Reconstruction Model (FlexRM), which effectively processes an arbitrary number of input views to reconstruct a detailed 3D asset. This method allows Flex3D to generate significantly higher-quality 3D assets. Our results show it outperforms current methods, with users preferring its results over 92% of the time. This research makes it easier to create realistic 3D content for applications like video games and virtual reality, making 3D creation tools more powerful and accessible.

Primary Area: Applications->Computer Vision

Keywords: 3D Generation, 3D Reconstruction, Large 3D Models

Submission Number: 6791

Loading