\section{Method}
\label{sec:method}

\subsection{Flow Matching}
\label{sec:fm}
% % Introdurre dicendo che facciamo una brief introduction al flow matching, con particolare riferimento alla sua variante di Optimal Transport introdotta da Lipmann et al. ... a brief review of Flow Matching ...
% We start with a brief review of Flow Matching, a generative modelling technique introduced by \citet{lipman2023flowmatchinggenerativemodeling}. 
% FM aims at learning a mapping between two distributions, a complex distribution $P_0$, and a simple distribution $P_1$, usually modelled as a standard normal distribution $\mathcal{N}(0,1)$. 
% Similarly to DDPMs, core components of FM are the forward and backward processes. The forward process is a function describing the noising process of a data sample. Given a sample $x_0$ from $P_0$ and a timestep $t$, it returns a noised version of the sample $z_t$. It can be seen as a flow of the sample between $P_0$ and $P_1$:
% \begin{equation}
%   z_t=a_tx_0 + b_t\epsilon  \text{\quad where \quad} \epsilon \sim \mathcal{N}(0,1)  
%   \label{eq:fp}
% \end{equation}
% % REMOVED FOR MIDL
% % $t$ is the timestep describing how far in the flow we are, and it is uniformly distributed between $0$ and $1$. When $t=0$, $a=1$ and $b=0$, therefore $z=x_0$, i.e. we did not move in the flow. Conversely, when $t=1$ we have that $a=0$ and $b=1$, therefore $z=\epsilon$, i.e. the flow of any data sample ends into a realization of standard Gaussian noise.
% The flow is modelled by a vector field $u_t=\frac{dz_t}{dt}$, which returns the direction of the flow for a specific point. 
% % REMOVED FOR MIDL
% % It is defined as the derivative of the flow $u_t=\frac{dz_t}{dt}$.
% By learning $u_t$ with a neural network parametrized by $\Theta$, we learn the mapping between the two distribution $P_0$ ad $P_1$ and, by traversing the flow, we are able to generate new data from the distribution $P_0$. 
% % REMOVED FOR MIDL
% % To learn the vector field \citet{lipman2023flowmatchinggenerativemodeling} proposed a tractable learning objective called Conditional Flow Matching (CFM), which allows us to supervise the learning using pair of samples $x_0 \sim P_0$, $\epsilon \sim P_1$, and a randomly sampled timestep $t \in [0,1]$:
% % \begin{equation}
% %   \mathcal{L}_{CFM}= \mathbb{E}_{t,\,p_t(z|\epsilon)\,p(\epsilon)} \| v_\Theta(z, t) - u_t(z|\epsilon) \|_2^2 
% %   \label{eq:cfm}
% % \end{equation}

% Different ways of defining the forward process, i.e. different implementation of $a_t$ and $b_t$, define different flows between the two distributions $P_0$ and $P_1$. An intuitive way proposed by \citet{lipman2023flowmatchinggenerativemodeling} to define the mapping between the distributions is to use Optimal Transport, i.e. a straight path between them.
% This means that in OTFM the forward process for a single sample, and the ground truth vector field that the network aims to learn can be described as:
% \begin{equation}
%   z_t = (1-t)x_0 + t\epsilon\text{,\quad \quad} u_t = \epsilon - x_0
%   \label{eq:forward_ot}
% \end{equation}
% In practice, at training time, for a given image we sample a timestep $t \in [0,1]$ and a noise sample $\epsilon \sim \mathcal{N}(0,1)$. We compute the noised version of the image $z$ as per \equationref{eq:forward_ot} and we pass $t$ and $z$ as input to a denoiser network, which computes the vector field $v_\Theta(z, t)$. The loss is then the difference between the predicted vector field and the ground truth one.
% Conversely, at inference, given a number of steps $S$, we start from standard Gaussian noise $z=\epsilon$ at $t=1$ and predict the vector field $v_\Theta(z, t)$ with the trained network. We compute the updated sample and timestep by moving in the predicted direction in a small step $z_{new} = z - \frac{v_\Theta(z, t)}{S}$ and $t_{new} = t - \frac{1}{S}$. We perform this operation $S$ times to reach a synthetic sample from the data distribution $P_0$.
We start with a brief review of Flow Matching, a generative modelling technique introduced by \citet{lipman2023flowmatchinggenerativemodeling}. 
The aim of FM is to learn a mapping between two distributions, a complex distribution $P_0$, the one of the real data, and a simple distribution $P_1$, usually modelled as a standard normal distribution $\mathcal{N}(0,1)$.
Similarly to DDPMs, core components of FM are the forward and backward processes. The forward process is a function describing the noising process of a data sample. Given a sample $x_0$ from $P_0$ and a timestep $t$, it returns a noised version of the sample $z_t$. It can be seen as a flow of the sample between $P_0$ and $P_1$:
\begin{equation}
  z_t=a_tx_0 + b_t\epsilon  \text{\quad where \quad} \epsilon \sim \mathcal{N}(0,1)  
  \label{eq:fp}
\end{equation}
$t$ is the timestep describing how far in the flow we are, and it is uniformly distributed between $0$ and $1$. $a_t$ and $b_t$ are two functions describing the noising process. When $t=0$, $a=1$ and $b=0$, therefore $z=x_0$, i.e. we did not move in the flow. Conversely, when $t=1$ we have that $a=0$ and $b=1$, therefore $z=\epsilon$, i.e. the flow of any data sample ends into a realization of standard Gaussian noise.
The flow is modelled by a vector field $u_t$, which returns the direction of the flow for a specific point. It is defined as the derivative of the flow $u_t=\frac{dz_t}{dt}$.
By learning the vector field $u_t$ with a neural network parametrized by $\Theta$, we learn the mapping between the two distribution $P_0$ ad $P_1$ and, by traversing the flow, we are able to generate new data from the distribution $P_0$. To learn the vector field \citet{lipman2023flowmatchinggenerativemodeling} proposed a tractable learning objective called Conditional Flow Matching (CFM), which allows us to supervise the learning using pair of samples $x_0 \sim P_0$, $\epsilon \sim P_1$, and a randomly sampled timestep $t \in [0,1]$:
\begin{equation}
  \mathcal{L}_{CFM}= \mathbb{E}_{t,\,p_t(z|\epsilon)\,p(\epsilon)} \| v_\Theta(z, t) - u_t(z|\epsilon) \|_2^2 
  \label{eq:cfm}
\end{equation}
Different ways of defining the forward process, i.e. different implementation of $a_t$ and $b_t$, define different flows between the two distributions $P_0$ and $P_1$. An interesting case is the one that recovers DDPMs learning objective~\cite{lipman2023flowmatchinggenerativemodeling, pmlr-v235-esser24a}, which means that FM subsumes DDPMs. A more intuitive way to define the mapping between the distributions is to use Optimal Transport, i.e. a straight path between them:
\begin{equation}
  a_t=1-t \text{,\quad} b_t=t 
  \label{eq:ab}
\end{equation}
This means that in Flow Matching with Optimal Transport (OTFM)~\cite{lipman2023flowmatchinggenerativemodeling} the forward process for a single sample, and the ground truth vector field that the network aims to learn, can be described as:
\begin{equation}
  z_t = (1-t)x_0 + t\epsilon\text{,\quad \quad} u_t = \epsilon - x_0
  \label{eq:forward_ot}
\end{equation}
In practice, at training time, for a given image we sample a timestep $t \in [0,1]$ and a noise sample $\epsilon \sim \mathcal{N}(0,1)$. 
We compute the noised version of the image $z$ as per \equationref{eq:forward_ot} and we pass $t$ and $z$ as input to a denoiser network, which computes the vector field $v_\Theta(z, t)$. The loss is then the difference between the predicted vector field and the ground truth one.
Conversely, at inference, given a number of steps $S$, we start from standard Gaussian noise $z=\epsilon$ at $t=1$ and predict the vector field $v_\Theta(z, t)$ with the trained network. We compute the updated sample and timestep by moving in the predicted direction in a small step $z_{new} = z - \frac{v_\Theta(z, t)}{S}$ and $t_{new} = t - \frac{1}{S}$. We perform this operation $S$ times to reach a synthetic sample from the data distribution $P_0$.


% We start with a brief review of Flow Matching with Optimal Transport (OTFM), a generative modelling technique introduced by \citet{lipman2023flowmatchinggenerativemodeling}. 
% The aim of FM is to learn a mapping between two distributions, a complex distribution $P_0$, and a simple distribution $P_1$, usually modelled as a standard normal distribution $\mathcal{N}(0,1)$. 
% Similarly to DDPMs, core components of FM are the forward and backward processes. The forward process is a function describing the noising process of a data sample. Given a sample $x_0$ from $P_0$, $\epsilon$ from $P_1$, and a timestep $t$, it returns a noised version of the sample $z_t$. It can be seen as a flow of the sample between $P_0$ and $P_1$. In OTFM, this is defined as a straight line between the two samples: 
% \begin{equation}
%   z_t = (1-t)x_0 + t \epsilon\text{,\quad \quad} u_t = \epsilon - x_0
%   \label{eq:forward_ot}
% \end{equation}
% The flow is modelled by a vector field $u_t=\frac{dz_t}{dt}$, which returns the direction of the flow for a specific point.
% By learning $u_t$ with a neural network parametrized by $\Theta$, we learn the mapping between the two distribution $P_0$ ad $P_1$ and, by traversing the flow, we are able to generate new data from the distribution $P_0$.
% In practice, at training time, for a given image we sample a timestep $t \in [0,1]$ and a noise sample $\epsilon \sim \mathcal{N}(0,1)$. We compute the noised version of the image $z$ as per \equationref{eq:forward_ot} and we pass $t$ and $z$ as input to a denoiser network, which computes the vector field $v_\Theta(z, t)$. The loss is then the difference between the predicted vector field and the ground truth one.
% Conversely, at inference, given a number of steps $S$, we start from standard Gaussian noise $z=\epsilon$ at $t=1$ and predict the vector field $v_\Theta(z, t)$ with the trained network. We compute the updated sample and timestep by moving in the predicted direction in a small step $z_{new} = z - \frac{v_\Theta(z, t)}{S}$ and $t_{new} = t - \frac{1}{S}$. We perform this operation $S$ times to reach a synthetic sample from the data distribution $P_0$.

\subsection{Architecture, training and inference}
\label{sec:arch}
We use OTFM to train a generator to synthesize novel 3D skeletal data, but
to generate volumes at a reasonable resolution it is not feasible to work directly in the data space, as it would require prohibitive memory resources. Instead, we apply the generative process in a lower-dimensional latent space. As illustrated in \figureref{fig:architecture}, to enable this we design two main components starting from the work of \citet{khader_denoising_2023}: an autoencoder, trained in the first stage;
% , which learns a compressed representation of the volumes by encoding and decoding them minimizing the reconstruction loss
and a denoiser, which is trained in the second stage to generate novel but realistic samples within the latent space. %FORSE PARLARE QUI DEL PERCHè VOXELLIZATION 

\begin{figure}[ht]
  \centering
  %\fbox{\rule{0pt}{2in} \rule{0.9\linewidth}{0pt}}
   \includegraphics[width=.90\linewidth]{assets/architettura_completa - MIDL.drawio.png}

   \caption{Architecture of the proposed generative model. The model is trained in two stages: the VQ-VAE training, with the loss being computed at the highest available resolution; the denoiser training, with OTFM or DDPM learning objective depending on the experiment.
   % Architecture of the Latent Diffusion Model. 
   %The model is trained in two stages: the VQ-VAE training, with the loss being computed at the highest available resolution; the denoiser training, which is trained with OTFM or DDPM learning objective depending on the experiment.
   }
   \label{fig:architecture}
\end{figure}

Training the autoencoder with the data at their highest resolution is often prohibitive in terms of memory consumption, forcing volumes to be downsampled during preprocessing and causing information loss. We designed the autoencoder as a VQ-VAE~\cite{razavi_generating_2019} modified to minimize the information loss caused by the downsampling. The VQ-VAE, in blue in \figureref{fig:architecture}, is made of three main components: an encoder, a decoder, and a vector quantization layer. The latter consists of a codebook of $n$ vectors whereby each element of the dense representation is replaced by the nearest code vector (in Euclidean distance), resulting in a discrete latent representation of the input data. To retain high resolution information, as shown in \figureref{fig:architecture}, we added a deterministic trilinear upsampling layer as last layer of the decoder. In this way, the reconstruction loss of the autoencoder is computed at the highest resolution possible while causing negligible memory consumption. 
% MODIFIED FOR MIDL
The VQ-VAE is optimized with two loss components: a reconstruction loss $L_{rec}$ defined as the L1 distance between a volume $x$ and its reconstruction $\hat{x}$; and a commitment loss $L_{commit}$, the mean squared error between the encoder's output and the selected code vector. Each vector of the codebook is optimized by maintaining an exponential moving average of all the dense vectors that get mapped to it.

The denoiser, in red in \figureref{fig:architecture}, is implemented as a 3D~U-Net~\cite{ho2022videodiffusionmodels}.
% REMOVED FOR MIDL
%, which is a UNet \cite{ronneberger_u-net_2015} modified to support 3D input data by substituting 2D convolutions with 3D convolutions. 
% REMOVED FOR MIDL
% The model can be trained conditioned on a class label, which is embedded, concatenated to the timestep embedding, and fed as input to the cross-attention layers of the UNet3D. 
% MODIFIED FOR MIDL
The model is optimized with OTFM, but we also train a variant with the DDPM learning objective to conduct a comparative experimental evaluation (see \sectionref{sec:experiments}). The training pipeline is identical across all experiments independently of the objective. 
The model receives as input the latents of the training volumes, normalized to be approximately in the $[-1,1]$ range. Following \citet{khader_denoising_2023}, this is done via min-max normalization using as bounds the minimum and maximum values that can be found in the learned codebook. 
According to the forward process, the noised latent $z_t$ (with $t$ being sampled from a discrete uniform distribution defined over $300$ timesteps) is computed and successively fed to the network to regress the vector field, as described in \sectionref{sec:fm}. 
In addition, during training, the model is conditioned on a class label, which is embedded and concatenated with the timestep embedding to produce a conditioning vector. 
This conditioning vector is used to scale and shift the intermediate activations of the convolutional blocks of the U-Net.
In our setting, the generation is conditioned on the Quality Score, which is used as the class label and is a feature of the data described in \sectionref{sec:data}.
% The model is conditioned on the Quality Score, a feature of our data described in \sectionref{sec:data}.

% MODIFIED FOR MIDL
At inference time, new volumes are generated by reversing the forward process.
Starting from $z_t$ sampled from the distribution $P_1$, i.e. pure noise and $t=1$, we iteratively query the denoiser and use the output to compute the $z_{t-1}$, until we reach $t=0$. We do this in $S=300$ steps. The generated latent is then quantized and decoded to yield a synthetic sample. The resulting volume is a voxel grid with the same size as the original dataset, and it can be further processed to obtain a mesh via the marching cube algorithm~\cite{marching_cubes_10.1145/37401.37422}.
A more detailed description of our architecture, together with the training hyperparameters, is provided in the supplementary material, \sectionref{sec:arch_hyper}, while the ablation studies on the architectural choices are presented in \sectionref{sec:arch_ablation} of the supplementary.

\subsection{Dataset and preprocessing}
\label{sec:data}
The data used in this work is the union of two different datasets of anonymized head CT scans: CQ500, a publicly available dataset of $355$ scans introduced by \citet{chilamkurthy_development_2018}; and a private dataset of $591$ scans from Bologna’s Sant’Orsola hospital\footnote{The use of this dataset for research purposes was approved by the local institution. Protocol details will be provided upon acceptance.}.
% CAT REMOVED FOR MIDL
% CT scans are a medical imaging modality used to obtain internal images of the body. In the case of the head they are used to detect infarction, tumors, calcifications, hemorrhage and bone trauma. Depending on the intensity of the image, experts can distinguish between soft tissues and bones, with the latter identified as the brightest structures. Several clinical tasks need clinicians to operate solely on the bone structure, thus requiring a segmentation of the original volume. This is done via manual thresholding of the intensity, and can result in unwanted holes mainly caused by lack of contrast in the data, insufficient thickness of the bones, noise and artifacts.
An expert segmented the volumes to obtain meshes depicting the skeletal part of the head, and subsequently aligned them to a reference skull. Since the CTs have been acquired for different purposes, most of the scans do not depict the entirety of the skull. Therefore, the same expert labelled them with a Quality Score (QS), which describes the extension and completeness of each shape: as illustrated in \figureref{fig:qs_ex}, QS~1 data represent the least complete skulls; QS~2 data include a complete skull cap but only a very limited portion of the nasal area; QS~3 data also contain a complete skull cap, along with a larger portion of the nasal area; QS~4 data may lack the skull cap but must contain the complete nasal area; QS~5 data contain almost complete skulls while QS~6 scans contain the mandible but miss the upper part of the skull. 
Quality Score 1 meshes are not informative enough to be kept for training the models and have been removed from the dataset. 
Moreover, other scans had been manually labelled as not suitable and thus have been discarded. 
The final dataset is made of $908$ scans, $341$ from the public dataset and $567$ from Sant’Orsola’s dataset, and is split in a stratified fashion to obtain a train ($726$) and a test set ($182$). 
The distribution of Quality Scores among the splits is depicted in \figureref{fig:qs_dist}, which shows that the dataset is imbalanced, as there are fewer QS~6 volumes than all other Quality Scores.
We will leverage conditioned generation to mitigate the imbalance and evaluate its effectiveness in clinical downstream tasks.
 
% \begin{figure}[ht]
%   \centering
%   %\fbox{\rule{0pt}{2in} \rule{0.9\linewidth}{0pt}}
%    \includegraphics[width=0.5\linewidth]{assets/data_dist_2.png}

%    \caption{Distribution of the Quality Scores in the train and test splits. The distribution is unbalanced, with fewer QS6 skulls with respect to the other classes.}
%    \label{fig:qs_dist}
% \end{figure}

%FORSE PARLARE QUI DEL PERCHè VOXELLIZATION
Since the models require voxels as input data, the first data preprocessing step consists in the voxelization of the meshes, which is done using an isotropic spacing of $0.51mm$, resulting in cubic voxel grids of side $512$. To reduce memory consumption, the grids are subsequently cropped to get rid of regions that are empty across all training skulls, leading to grids of size $456$x$352$x$512$. As described in \sectionref{sec:arch} and illustrated in \figureref{fig:architecture}, this represents the 
size at which the autoencoder loss is computed thanks to the upscaling performed by the trilinear upsampling component, i.e. the 
output size of the model. On the other hand, as shown in the downsampling block of \figureref{fig:architecture}, 
due to computational constraints, the input volumes are downsampled with a scaling factor $s=0.82$, leading to input grids of size $374$x$289$x$420$. Finally, the volumes are min-max normalized to the range $[-1,1]$.


\begin{figure}[ht]
\floatconts
  {fig:qs}%
  {\caption{(a): Examples of meshes for each Quality Score. (b): Distribution of the Quality Scores in the train and test splits.}}%
  {%
    \subfigure{%
      \label{fig:qs_ex}%
      \includegraphics[width=0.48\textwidth]{assets/qs_examples.drawio.png}%
    }\hfill
    \subfigure{%
      \label{fig:qs_dist}%
      \includegraphics[width=0.50\textwidth]{assets/data_dist_2.png}%
    }
  }
\end{figure}

% REMOVED FOR MIDL
% \subsection{Training and inference}
% \label{sec:training}
% \textbf{Training.}
% Since the generation happens in the latent space, the model must be trained in two stages.
% The first one consists in training the autoencoder to build the latent space.
% % REMOVED FOR MIDL
% %, to be able to encode volume into embeddings and decode them into the data space. 
% The VQ-VAE is optimized with two loss components: a reconstruction loss $L_{rec}$ defined as the L1 distance between a volume $x$ and its reconstruction $\hat{x}$; and a commitment loss $L_{commit}$, the mean squared error between the encoder's output and the selected code vector. Each vector of the codebook is optimized by maintaining an exponential moving average of all the dense vectors that get mapped to it.
% % REMOVED FOR MIDL
% % As explained in \sectionref{sec:arch}, the reconstruction loss is computed on the volume at its highest resolution.

% The second stage is the training of the UNet3D, which is optimized with OTFM. We also train the denoiser with DDPM learning objective to conduct a comparative experimental evaluation (see \sectionref{sec:experiments}). The training pipeline is identical across all experiments independently of the objective. The model receives as inputs latents of the training volumes normalized to be approximately in the $[-1,1]$ range. Following \citet{khader_denoising_2023}, this is done via min-max normalization using as bounds the minimum and maximum values that can be found in the learned codebook. According to the forward process, $z_t$ is computed with $t$ being sampled from a discrete uniform distribution, which is defined over $300$ timesteps. The noised latent is then fed to the network to regress the objective described in \equationref{eq:cfm}. The model is conditioned on the Quality Score.
% % TOLTO PER LUNGHEZZA
% %, but the true condition is substituted with a null condition with a probability $p_{null}= 0.10$, so that the generation can be both conditioned and unconditioned. 
% Additional implementation details are reported in supplementary \sectionref{sec:arch_hyper}.

% \textbf{Inference.}
% New volumes are generated by reversing the forward process.
% Starting from $z_t$ sampled from the distribution $p1$, i.e. pure noise and $t=1$, we iteratively query the denoiser and use the output to compute the $z_{t-1}$, until we reach $t=0$. We do this in $S=300$ steps. The generated latent is then quantized and decoded to yield a synthetic sample. The generation can be both conditioned on the Quality Score and unconditioned. The resulting volume is a voxel grid with the same size of the original dataset, and it can be further processed to obtain a mesh via the marching cube algorithm \cite{marching_cubes_10.1145/37401.37422}.
