A3D: Does Diffusion Dream about 3D Alignment?

Anonymous ICLR 2025 submission

Please see the updated version of the supplementary anonymous webpage at
https://qrd9ph4uipym.github.io

Abstract

We tackle the problem of text-driven 3D generation from a geometry alignment perspective. Given a set of the text prompts, we aim at the generation of a collection of objects with their semantically corresponding parts being aligned between these objects. Recent methods based on Score Distillation have succeeded in distilling the knowledge from 2D diffusion models to high-quality representations of the 3D objects. These methods handle multiple text queries separately, and therefore the resulting objects have a high variability in object pose and structure. However, in some applications, such as 3D asset design, it may be desirable to obtain a set of objects aligned with each other. In order to achieve the alignment of the corresponding parts of the generated objects, we propose to embed these objects into a common latent space and optimize the continuous transitions between these objects. We enforce two kinds of properties of these transitions: smoothness of the transition and plausibility of the intermediate objects along the transition. We demonstrate that both of this properties are essential for good alignment. We provide several practical scenarios that benefit from alignment between the objects, including 3D editing and object hybridization, and experimentally demonstrate the effectiveness of our method.

Our method A3D enables conditioning text-to-3D generation process on a set of text prompts to jointly generate a set of 3D objects with a shared structure (top). This enables a user to make "hybrids" combined of different parts from multiple aligned objects (middle), or perform text-driven structure-preserving transformation of an input 3D model (bottom).

Motivation

Collections of objects generated with existing text-to-3D methods lack structural consistency (top). Shapes obtained with existing text-driven 3D editing methods lack text-to-asset alignment and visual quality (middle). In contrast, our method enables the generation of structurally coherent, text-aligned assets with high visual quality (bottom).

Generation of multiple aligned 3D objects

We evaluate our method in the generation of sets of aligned objects on 15 pairs of prompts describing pairs of objects with similar morphology but different geometry and appearance, such as a car and a carriage. We include various categories of objects, namely different kinds of animals, humanoids, plants, vehicles, furniture, and buildings.

Below, we show pairs of objects generated with existing methods and our method, in different columns. Each pair of rows shows the results for one pair of prompts written below. For each object, we show a color rendering and a rendering of the geometry below it.

Our method generates both objects in the pair simultaneously, while the other methods initially generate one of the object and then generate the other one from the first. We show two sets of results. Here, we show the first set of results, with the left object generated first, denoted with p1→p2. We show the second set further.

Here, we show the second set of results with the right object generated first, denoted with p1←p2.

Hybridization: combining the aligned 3D objects

We show examples of the hybrid objects combining parts of aligned objects produced by our method, and illustrate the process of getting these hybrids below. In some of these experiments, we intentionally choose the hyperparameters of our method different from the ones used for the generation of pairs of objects above, to increase the visual difference between the generated objects for better visibility of the hybridization.

To choose which part of each object we want to use, we assign several anchor points to each object (shown in the left column) and manually place these points in the common 3D space of the objects. We define the spatial distribution of the latent code (shown in the second column) via linear interpolation between the latent codes corresponding to the objects associated with the two closest anchors. The resulting objects are shown on the right.

Pose-preserving transformation of 3D models

We evaluate the capability of our method to transform an initial 3D model while preserving its structure on 26 text prompts. For each prompt we find a coarse initial model with the desired structure on the web, or use the SMPL parametric human body model in a desired pose.

Below, we show the objects generated with existing methods and our method from an initial 3D model on the left. Each row shows the results obtained for the text prompt below. For each object, we show a color rendering and a rendering of the geometry.

Ablation

We compare our method with two branches of baselines for generating pairs of objects. We refer to the baselines in the first branch as (A-C), and in the second branch as (E, F), while (D) is our complete method. See the description of the baselines in Section 6.

For each pair of generated objects we show the results in three rows. The first two rows show pairs of objects generated with our method in the column (D) and the baselines in the other columns. The last row shows an overlay of the silhouettes of the objects, demonstrating the alignment of their structural parts.