\section{Method} \label{sec:method}

Our generative framework comprises two models.
Section \ref{sec:encoder} details the implementation of the variational autoencoder (VAE) for shape-to-semantic signed distance field (SDF) mapping. The VAE includes an encoder that learns robust latent shape representations for synthetic shape sampling and a decoder that outputs a conditional SDF corresponding to a given latent. Section \ref{sec:generator} describes the shape latent diffusion model used to sample latents for synthetic shape generation. An overview of the framework is provided in Figure \ref{fig:2_method_overview}.

\begin{figure}[t]
    \centering
    \includegraphics[width=\linewidth]{figures/2_method_overview.png}
    \caption{Schematic overview of the vessel generation framework: A semantic pointcloud is encoded to a shape representation using self-attention and feed-forward (FF) blocks. The shape representation is decoded and the SDF values are predicted for the query points. The diffusion model generates shape representations that are decoded to synthetic shapes.}
    \label{fig:2_method_overview}
\end{figure}

% \begin{figure}
%     \centering
%     \includegraphics[width=\linewidth]{figures/2_method_overview.png}
%     \caption{Schematic overview of the vessel generation framework: A semantic pointcloud is encoded to a shape representation using self-attention and feed-forward (FF) blocks. The shape representation is decoded and the SDF values are predicted for the query points. The diffusion model generates shape representations that are decoded to synthetic shapes.}
%     \label{fig:2_method_overview}
% \end{figure}

\subsection{Semantic Shape Autoencoder} \label{sec:encoder}

We parameterize a shape as a point cloud of $N$ points $\mathbf{p}_i$ that lie on the shape surface  with corresponding one-hot encodings $\mathbf{h}_i$ of semantic labels.
We use a VAE with a transformer architecture \cite{vaswani2017attention} to encode the pointcloud to a set of $M$ latent vectors $\mathbf{z}_i$, where $M < N$.
Next, the latent-set is decoded to an SDF that represents the zero iso-surface of the encoded shape.

\paragraph{Signed Distance Fields}

SDFs implicitly represent the surface of shapes as a functions $f(\mathbf{p}) = d$ that outputs the signed distance $d$ from a spatial coordinate $\mathbf{p}$ to the shape surface, where $d$ is negative for points inside the shape volume.
\footnote{We follow the convention where $d$ is negative inside the shape volume to ensure that the surface normal vectors point outward.}
The shape surface is defined by the zero level-set, i.e., all coordinates $\mathbf{p}_i$ where $f(\mathbf{p}_i) = 0$. 
SDFs satisfy the Eikonal equation, $||\nabla_{\mathbf{p}}f|| = 1$, and for coordinates on the surface, the gradient $\nabla_{\mathbf{p}}f$ corresponds to the surface normal vector $\mathbf{n}^{\mathbf{p}}$. 
The Eikonal constraint acts as an inductive bias for implicitly regularizing SDF learning \cite{gropp2020implicit}. 
By leveraging this constraint, it becomes unnecessary to know $d$ for points off the surface, eliminating the need for ground-truth signed distances.
As a result, SDFs can be learned in a self-supervised manner by enforcing $d = 0$ on surface points and ensuring that the gradients of both on-surface and off-surface points satisfy the Eikonal equation.

\paragraph{Shape Encoding}

The input to the encoder is a set of $N$ vectors $\mathbf{x}$ that are the concatenation of points $\mathbf{p}$ and one-hot encoded semantic vessel labels $\mathbf{h}$.
Following \cite{zhang20233dshape2vecset}, we use furthest-point sampling (FPS) to obtain a lower-resolution set of $M$ vectors $\mathbf{y}$ that are then used to gather downsampled feature vectors from the input using cross-attention:
\begin{align} \label{eq:cattn}
    \text{CrossAttn}(\mathbf{y}_i, \{\mathbf{x_1,\cdots,x_N}\}) = \sum_j a_{ij}\mathbf{v}(\mathbf{x}_j)\quad\text{and}\quad a_{ij} = \text{softmax}\left(\frac{\mathbf{q}(\mathbf{y}_i)^T\mathbf{k}(\mathbf{x}_j)}{\sqrt{D}}\right),
\end{align}
where $\mathbf{q}, \mathbf{k}, \mathbf{v} \in \mathbb{R}^D$ are the query, key, and value functions used in the attention mechanism.
\footnote{In practice, we first embed the set of concatenated coordinate and feature vectors $\mathbf{x}$ with a linear embedding before downsampling.}
Note that Equation \ref{eq:cattn} becomes self-attention when $\mathbf{x} = \mathbf{y}$.
The feature vectors are then mapped using a series of self-attention blocks followed by a linear map to a set of $C'$-dimensional $\boldsymbol{\mu}$ and $\log\boldsymbol{\sigma}^2$ from which the $M$ shape latents $\mathbf{z}$ are sampled.  

\paragraph{Shape Decoding}

The decoder maps latent representations to $C$-dimensional feature vectors, which are then interpolated by query coordinate points using cross-attention. 
Each interpolated coordinate is subsequently mapped to a signed distance and semantic label via a two-layer linear mapping with GELU activation. We evaluate the surface point distances $\tilde{d}$ and one-hot semantic labels $\mathbf{\tilde{h}}$ predicted by the decoder $g$ for query points surface points $\mathbf{p}$ and off-surface points $\mathbf{o}$ with the following objectives:
\begin{align}
    \mathcal{L}_{\text{Surface}} = |\tilde{d}|, \quad \mathcal{L}_{\text{Eikonal}} = (||\nabla_\mathbf{\{o,p\}}g|| - 1)^2, \quad\text{and}\quad \mathcal{L}_{\text{Normal}} = \langle\nabla_{\mathbf{p}}g, \mathbf{n}_\mathbf{p} \rangle,
\end{align}
and the predicted labels $\mathbf{\tilde{h}}$ as
\begin{align}
     \mathcal{L}_{\text{MSE}} = \text{MSE}(\mathbf{h}, \mathbf{\tilde{h}}).
\end{align}
Here, $|\cdot|$ denotes the $L1$-norm, $||\cdot||$ denotes the $L2$-norm, $\langle\cdot\rangle$ indicates cosine-similarity, and MSE is the mean squared error. We use MSE because the semantic SDF represents both signed distance and semantic class information as a unified vector field, making the task naturally suited for regression. Off-surface points $\mathbf{o}$ are generated by adding noise sampled from a Gaussian distribution with a standard deviation of 0.3 to the surface points. As the SDF defines the shape surface boundary, semantic labels are evaluated only for surface points, excluding a background label for off-surface points.

% \paragraph{Uncertainty Prior}

% For a local region of the shape surface, the signed distance function (SDF) increases linearly when moving along the surface normal.
% This property allows the decoder to model the uncertainty of its distance prediction as a one-dimensional zero-mean Gaussian along the surface normal.
% The uncertainty is captured through self-supervised optimization of the negative log-likelihood:
% \begin{align}\label{eq:prior}
%     \mathcal{L}_{\text{NLL}} = \frac{1}{2}\left(\log{\sigma^2} - \sigma^{-2}|d|\right),
% \end{align}
% where the decoder predicts the both log-variance $\log\sigma^2$ and the distance $d$ with corresponding point-features.
% Intuitively, in regions on surface where the decoder often predicts incorrect distances, it outputs a higher standard deviation, minimizing Equation \ref{eq:prior}.

\paragraph{Training Objective}

The complete objective of the VAE minimizing the following loss
\begin{align}\label{eq:train_objective}
    \mathcal{L}_{\text{VAE}} = \mathcal{L}_{\text{Surface}} + \lambda_1\mathcal{L}_{\text{Eikonal}} + \lambda_2\mathcal{L}_{\text{Normal}} + \lambda_3\mathcal{L}_{\text{KL}} + \lambda_4\mathcal{L}_{\text{MSE}},
\end{align}
where $\mathcal{L}_{\text{KL}}$ is the KL-regularization of the latents $\mathbf{z}_i$ and the $\lambda_i$ weigh the contributions of the individual loss terms. The $\mathcal{L}_{\text{MSE}}$ term can be omitted if semantic labels are not available or not required.

\subsection{Shape Latent Diffusion} \label{sec:generator}

Generating synthetic shapes by sampling latents $\mathbf{z}$ directly from the Gaussian prior often results in poor shape reconstructions because VAEs are prone to the prior-hole problem in large and complex latent spaces \cite{vahdat2021score}. 
This means that certain regions of the latent space do not hold any meaningful information.
The prior-hole problem can be addressed by using an auxiliary sampling model $\phi$ that learns to sample latents exclusively from regions that yield high-quality reconstructions.
To achieve this, we use latent-diffusion \cite{rombach2022high} combined with a transformer architecture and optimize the mean-squared error between noisy and denoised latent sets:
\begin{align}
    \mathcal{L}_\text{Denoise} = \text{MSE}(\phi(\mathbf{z} + \epsilon, t), \mathbf{z}),
\end{align}
where $\epsilon \sim \mathcal{N}(0, t)$ is noise sampled at a given noise-level $t$. Details of the pre-conditioning diffusion methodology used in our approach can be found in \cite{karras2022elucidating}.
