\appendix 

\section{Code availability}
The code is available at the following link:\\ \url{https://github.com/Chavelanda/skeletal_fm}. 

\section{LDM architecture and hyperparameters}
\label{sec:arch_hyper}
This section provides a detailed description of the architecture used for generating synthetic volumes introduced in \sectionref{sec:arch}, together with the hyperparameters used for training and inference. Both the autoencoder and the denoiser have been trained on 4 NVIDIA Ampere A100 with $64GiB$ of GPU RAM for approximately 5 days.

\subsection{VQ-VAE}
The architecture used to learn a compressed representation of the volumes is a VQ-VAE~\cite{esser_taming_2021} modified starting from the version of \citet{khader_denoising_2023} to compute the loss at the highest available resolution. Given a dataset $D$ containing volumes $v^* \in \mathbb{R}^{1 \space \text{x} \space D \space \text{x} \space H \space \text{x} \space W}$, the encoder of the VQ-VAE is fed with pre-processed downsampled volumes $v \in \mathbb{R}^{1\space\text{x}\space d \space \text{x} \space h \space \text{x} \space w}$ to compute dense latent representations $z \in \mathbb{R}^{k \space \text{x} \space \frac{d}{s} \space \text{x} \space \frac{h}{s} \space \text{x} \space \frac{w}{s}}$, where $d$ is the depth, $h$ is the height, and $w$ is the width of the pre-processed volume, $k$ is the number of channels in the latent and $s$ is the encoder compression factor. In the quantization step, each latent feature vector $z_i \in \mathbb{R}^k$ is replaced by the closest code $q_i= Q(z_i)$ contained in the codebook $Q$. The decoder, receiving the quantized latent $q$, is used to reconstruct the volume $\hat{v} \in \mathbb{R}^{1\space\text{x}\space d \space \text{x} \space h \space \text{x} \space w}$, which is then upsampled with trilinear upsampling to compute $\hat{v}^* \in \mathbb{R}^{1 \space \text{x} \space D \space \text{x} \space H \space \text{x} \space W}$. The autoencoder is trained with two different losses: a reconstruction loss $L_{recon} = || v^* - \hat{v}^*  ||_1$ and a commitment loss $L_{commit} = \frac{1}{I} \sum_{i=0}^I||z_i - q_i||^2_2$, where $I = \frac{d}{s}*\frac{h}{s}*\frac{w}{s}$.
The encoder is a sequence of blocks that first downsample the input via a 3D convolution and then process it through a residual block. The decoder is similar, with each block having a strided 3D convolution to upsample the input followed by two residual blocks. \tableref{tab:hyper_LDM} shows the hyperparameters for training the model.

\subsection{3D~U-Net}
The 3D~U-Net~\cite{ho2022videodiffusionmodels} is the network used to denoise the latents. It follows the traditional encoder-bottleneck-decoder structure, and is conditioned on both the noising timestep $t$ and the class $c$, i.e. the Quality Score. The timestep $t \in [0, T[$ is embedded using a sinusoidal positional embedding followed by two linear layers with a GELU activation in between. The class $c$ is embedded using a learned embedding layer to match the dimensionality of the timestep embedding. The embeddings of $t$ and $c$ are concatenated together to form the conditioning vector. 
To adapt the UNet architecture to volumetric data, each 2D convolution is substituted with a 3D convolution using $3$x$3$x$3$ kernels. Conditioning is incorporated in the convolutional blocks by scaling and shifting the intermediate activation with the conditioning vector. Each convolutional block is followed by two attention layers: first, a 2D spatial attention over $H$ and $W$, where $D$ is treated as batch axis; secondly, a 1D attention across the depth dimension $D$, with all other dimensions treated as batch axes. The details on the hyperparameters used for training the model are shown in \tableref{tab:hyper_LDM}.

\begin{table}[!ht]
  \centering
  \floatconts
  {tab:hyper_LDM}
  {\caption{\textbf{Hyperparameters for training the VQ-VAE and the 3D~U-Net.}}}
  {
  \begin{tabular}{|l|c|}
  \hline
  \textbf{VQ-VAE} & \\
    \hline
    Encoder compression factor $s$ & 4 \\
    \hline
    Codebook size & 16384 \\
    \hline
    Codebook dimensionality $k$  &  8 \\
    \hline
    Batch size &  1\\
    \hline
    Learning rate & 1e-4 \\
    \hline
    Num. training iterations & 150 000 \\ 
    \hline
    \textbf{3D U-Net} & \\
    \hline
    Batch size &  1\\
    \hline
    Learning rate &  5e-5\\
    \hline
    No. training iterations & 200 000 \\
    \hline
    Timesteps T & 300 \\
    \hline
    
  \end{tabular}}
\end{table}

\section{Autoencoder ablation studies}
\label{sec:arch_ablation}
In this section, we validate different aspects of the autoencoder: the learning objective, the architecture and the effect of the input volume resolution.

\subsection{Learning objective} \citet{khader_denoising_2023}, the work we started from to develop our architecture, employed a VQ-GAN to build the latent space, which is a modified VQ-VAE trained with both the reconstruction loss and an adversarial component~\cite{esser_taming_2021}. Moreover, the original work included also a perceptual LPIPS loss~\cite{8578166} computed on random slices of the volumes. We run experiments with both the VQ-VAE and VQ-GAN formulations to evaluate which one provided better reconstructions of craniofacial skeletal data, and we conducted an ablation on the perceptual loss component to assess its utility. The VQ-VAE training was faster and reached lower losses with respect to the VQ-GAN one, and the results in \tableref{tab:ablation-ae} show that the reconstruction is better when using the simpler loss described in \sectionref{sec:arch}, without the adversarial component and without the perceptual loss.

% \begin{table}[!ht]
%   \centering
%   \small
%   \begin{tabular}{@{}lc@{}}
%     \toprule
%      & Reconstruction error \\
%     \midrule
%     VQ-GAN & 0.02323 \\
%     VQ-VAE + LPIPS & 0.00293 \\
%     VQ-VAE & \textbf{0.00227} \\
    
%     \bottomrule
%   \end{tabular}
%   \caption{Reconstruction error for different autoencoders}
%   \label{tab:ablation_ae_recon_error}
% \end{table}

\subsection{Architecture and input resolution.} In the 3D data generation pipeline, the autoencoder reconstruction quality acts as an upper bound for the quality of generated data. The quality is also constrained by the resolution of the input training volumes, which is often limited due to memory reasons. This is why we decided to investigate the difference in the reconstruction quality with or without the final upsampling layer and compared the models trained with different input volume resolutions. To evaluate the performance of the autoencoder we use the reconstruction error, which is always computed with respect to the highest resolution input volume. Lower resolution volumes are upsampled during post-processing to make the computation possible. The settings we tested are the following. 
\textbf{\textit{s=0.422}, up}: The downsampling factor is $0.422$ and there is the final upsampling layer, i.e. the loss is computed at the highest resolution. 
\textbf{\textit{s=0.53}, up}: the downsampling factor is $0.53$ and there is the final upsampling layer. 
\textbf{\textit{s=0.578}, no up}: the downsampling factor is $0.578$ and there is no final upsampling layer, i.e. the loss is computed at the downsampled resolution. 
\textbf{\textit{s=0.578}, up}: the downsampling factor is $0.578$ and there is the final upsampling layer.
\textbf{\textit{s=0.82}, up}: the downsampling factor is $0.82$ and there is the final upsampling layer.

By comparing the results on experiments with same downsampling factor ($s=0.578$) but different architecture, we can observe that upsampling the volume before loss computation is beneficial for the reconstruction quality. As expected, the input volume resolution correlates with reconstruction quality, but the quality gain decreases as the resolution goes higher. The downsampling factor we used in our main experiments, $s=0.82$, is the one that yields the smallest resolution loss given the memory constraints.

%What to add?

% \begin{table}[!ht]
%     \centering
%     \small
%     \begin{tabular}{lc}
%         \toprule
%         Res, Model & Reconstruction error \\ 
%         \midrule
%         s=0.422, up & 0.0037 \\ 
%         s=0.53, up & 0.0033 \\ 
%         s=0.578, no up & 0.0039 \\ 
%         s=0.578, up & 0.0025 \\ 
%         s=0.82, up & \textbf{0.0023} \\ 
%         \bottomrule
%     \end{tabular}
%     \caption{Reconstruction error for different input volume resolutions and different autoencoder architectures.}
%     \label{tab:resolution_ablation}
% \end{table}

\begin{table}[!ht]
    \centering
    \floatconts
    {tab:ablation-ae}
    {\caption{\textbf{Autoencoder ablation results.} Left: reconstruction error for different autoencoders. Right: reconstruction error for different input volume resolutions and different autoencoder architectures.}}
    {\vspace{-10pt}%\small
    \begin{minipage}[t]{0.48\textwidth}
        \centering
        \resizebox{\textwidth}{!}{
        \begin{tabular}{@{}lc@{}}
        \toprule
         Autoencoder & Reconstruction error $\downarrow$ \\
        \midrule
        VQ-GAN & 0.02323 \\
        VQ-VAE + LPIPS & 0.00293 \\
        VQ-VAE & \textbf{0.00227} \\
        
        \bottomrule
      \end{tabular}}
      % \caption{Reconstruction error for different autoencoders}
      % \label{tab:ablation_ae_recon_error}
    \end{minipage}
    \hfill    
    \begin{minipage}[t]{0.48\textwidth}
    \centering
   \resizebox{\textwidth}{!}{
    \begin{tabular}{lc}
        \toprule
        Res, Model & Reconstruction error $\downarrow$ \\ 
        \midrule
        s=0.422, up & 0.0037 \\ 
        s=0.53, up & 0.0033 \\ 
        s=0.578, no up & 0.0039 \\ 
        s=0.578, up & 0.0025 \\ 
        s=0.82, up & \textbf{0.0023} \\ 
        \bottomrule
    \end{tabular}}
    % \caption{Reconstruction error for different input volume resolutions and different autoencoder architectures.}
    % \label{tab:resolution_ablation}
    \end{minipage}}
\end{table}

\section{Generation results supplementary}
In this section, we present the results of an analysis conducted on synthetic datasets generated by the OTFM-based and DDPM-based models, aimed at quantifying the occurrence of failed generations and conditioning errors in both cases. As highlighted by \tableref{tab:quant-failure}, OTFM demonstrates greater robustness in generating novel samples, resulting in fewer completely out-of-distribution samples and more accurate conditioning.

\begin{table}
\centering
\floatconts
{tab:quant-failure}
{\caption{\textbf{Failure analysis experiment.} Percentage of failed generations and conditioning errors for DDPM-based and OTFM-based models.}}
{\vspace{-10pt}%\small
\begin{tabular}{lcccccc}
\toprule
       & failed generations       & conditioning errors          \\
       \midrule
DDPM & 12\%          & 32\%      \\
OTFM & \textbf{4\%} & \textbf{6\%}    \\
\bottomrule
\end{tabular}}

\end{table}

\section{Downstream tasks supplementary}

This section provides further explanations and experiments on the downstream tasks. We describe the aligner neural network, and provide the hyperparameters to train it. Also the details to train the shape completion network are provided, together with the results on the dataset balancing experiment.

\subsection{Aligner architecture}
\label{sec:aligner}
In this section we describe the architecture of the aligner network. The network is made of two main components: firstly, one that estimates the rotation of the volume; secondly, one that refines the estimate and regresses a translation vector. As explained in \sectionref{sec:downstream}, the model takes as input a roto-translated point cloud. The cloud is encoded to obtain an high dimensional embedding using a modified PointNet++ (MSG), one of the modules of PointNet++~\cite{qi2017pointnet++}. PointNet++ (MSG) is modified by adding a skip connection from the output of the first Set Abstraction layer to the output of the encoder. The resulting embedding vector is processed by a rotation head, which is made of two MLP blocks followed by a linear layer that regresses a flattened rotation matrix. Each MLP block reduces the vector dimensionality by four and is made of a linear layer, followed by batch normalization and ReLU activation function. The rotation matrix regressed by the rotation head is used to obtain an approximate rotated skull. 

The second component is similar to the first one, and processes the approximate skull to obtain a finer rotation and a translation vector. The rotated skull is encoded with the modified PointNet++ (MSG), and the embedding is fed to the rotation head to recover another, finer, rotation matrix. Additionally, the embedded representation of the point cloud is also fed to a translation head to retrieve a translation vector. The translation head is similar to the rotation one: it is made of two MLP blocks followed by a linear layer, but in this case the final linear layer regresses a translation vector. The final rotation matrix and translation vector are used to compute the rotated skull. The L1 loss between the predicted skull and the ground truth one is calculated and used to train the network.

\subsection{Networks hyperparameters}
Details on the hyperparameters of both the aligner and the shape completion networks are provided in \tableref{tab:downstream-params}.

\begin{table}[!ht]
  \centering
  \floatconts
  {tab:downstream-params}
  {\caption{\textbf{Hyperparameters of downstream tasks networks.}}}
  {%\small
  \begin{tabular}{|l|c|}
  \hline
  \textbf{Aligner} & \\
    \hline
    No. training iterations & 7000 \\
    \hline
    Loss & L1 \\
    \hline
    Num. input points &  8192 \\
    \hline
    Batch size &  12 \\
    \hline
    Learning rate & 0.001 \\
    \hline
    Embedding dim &  1024 \\ 
    \hline
    \textbf{Shape completion} & \\
    \hline
    No. training iterations & 72800 \\
    \hline
    Loss & Chamfer Distance \\
    \hline
    Num. input points & 4096 \\
    \hline
    Batch size & 4 \\
    \hline
    Learning rate & 0.0001 \\
    \hline
  \end{tabular}}
\end{table}

\subsection{Shape completion balancing experiment}
\label{sec:shape_completion}
The results of the experiment on balancing the training set in the shape completion downstream task (\sectionref{sec:downstream}) have not been reported in the main paper because the baseline results, i.e. the results obtained using only real data, did not show an impact of the Quality Score in the performance of the model. As it is possible to observe from \tableref{tab:shape-completion-balancing}, the average Chamfer Distance between the predicted and ground truth point clouds does not correlate with the cardinality of the QS. On the contrary, the model performs better with QS~6 data. This is a task for which it would not make sense to use use the generator to create a balanced dataset. Still, for completeness, we report the results of the balancing experiment. 

\tableref{tab:shape-completion-balancing} shows that the distribution of the Quality Scores in the training set is not significant for model performance. The model trained with the balanced dataset and the model trained with the augmented stratified dataset have very similar overall performance. Moreover, the performance on each QS does not correlate with the cardinality of the class, confirming that in this task balancing the dataset is not useful. 


\begin{table}
\centering
\floatconts
{tab:shape-completion-balancing}
{\caption{\textbf{Shape completion experiment on dataset balancing using OTFM.} \textit{otfm-s} is the real dataset augmented in a stratified fashion. \textit{otfm-b} is the real dataset augmented to balance the classes cardinality.}}
{\vspace{-10pt}%\small
\begin{tabular}{lcccccc}
\toprule
       & overall       & QS 2          & QS 3          & QS 4          & QS 5          & QS 6          \\
       \midrule
100\% real   & 3.29 & 3.26          & 3.09          & 3.89          & 3.09          & 3.02 \\
%OTFM   & 3.36          & 3.41          & 3.14          & \textbf{3.74} & 3.28          & 3.19          \\
OTFM-s & 3.17          & 3.16         & 3.02        & \textbf{3.52} & 3.12       & \textbf{2.96}       \\
OTFM-b & \textbf{3.14} & \textbf{3.06}& \textbf{2.99}    & 3.61     & \textbf{2.99}     & 3           \\
\bottomrule
\end{tabular}}

\end{table}