\section{Experiments and discussion}
\label{sec:experiments}

\subsection{Evaluation protocol}
The trained models are used to generate synthetic datasets that are evaluated both qualitatively and quantitatively to assess their quality. 
For the qualitative evaluation, we provide both central slices of synthetic voxel grids and the generated meshes. 
Whereas for the quantitative evaluation, multiple metrics are computed to evaluate different aspects of the generation quality. More in detail, 
the Fréchet Inception Distance (FID)~\cite{heusel_gans_2017} is used to measure within a single value the quality of the generated dataset by comparing the feature distributions of real and synthetic data. This metric has already been used in different works concerning 3D medical data generation~\cite{pinaya_brain_2022, wang20243dmeddiffusion3dmedical, yazdani_flow_2025, kim_10452780, friedrich_wdm_2024}. %da descrivere di più??
Since the FID is a single scalar value, it does not allow us to disentangle realism from diversity in the generations. To do so, we compare real and synthetic data using precision and recall, with the first measuring the ability of the model of generating realistic volumes and the second indicating how much the generated dataset covers the whole spectrum of real data. There exist different methods to compute precision and recall, and in this work we used Improved Precision and Recall (IP, IR)~\cite{kynkäänniemi2019improvedprecisionrecallmetric}, as well as Probabilistic Precision and Recall (PP, PR)~\cite{Park_2023}. 
% REMOVED FOR MIDL
% We use IP and IR as they are the most popular implementations for precision and recall computation. We decided to complement them with PP and PR as \citet{Park_2023} showed that they are less sensible to outliers and provide a more reliable estimate of realism and diversity.

All the metrics described above (e.g. FID, precision, and recall) are computed in the feature space of a pretrained network. In particular, similarly to other works related to medical data~\cite{pinaya_brain_2022, friedrich_wdm_2024, wang20243dmeddiffusion3dmedical},  we exploit the features extracted using Med3D~\cite{chen_med3d_2019}, a 3D convolutional network trained on aggregated datasets from different medical tasks. The features are extracted by applying Global Average Pooling to the final layer before classification, resulting in feature vectors of size $2048$.
% REMOVED FOR MIDL
% When dealing with natural images this is often a pretrained Inception Network, but medical data present different features and thus require a different pretrained network. 

% optionally add cit to the other works that use med3d
Moreover, we use Multi-Scale Structural Similarity Index Measure (MS-SSIM)~\cite{wang_multiscale_2003} to evaluate the diversity within a dataset by computing the similarity between $1000$ random pairs of volumes of the same set and averaging the results. The lower the measure, the more diverse a dataset is. 
% ELIMINATO PER LUNGHEZZA ECCESSIVA
%Still, we do not aim at obtaining the lowest value possible, as having an MSSSIM close to zero would mean to have volumes completely different in structure and patterns between one another; we aim at obtaining similar results in the real and synthetic datasets, i.e. obtaining a synthetic dataset as diverse as the real one.


\subsection{Generation results}
Using the architecture presented in \sectionref{sec:arch}, we trained generative models both with DDPM and OTFM learning objectives. These models were used to synthesize datasets with the same cardinality and QS distribution as the original training set. \tableref{tab:quantitative results} shows the quantitative results on the synthetic datasets. To provide a reference for these metrics on real data, under the \textit{ref} setting we computed FID, precision, and recall between the training and test sets of the real data, and MS-SSIM on the real training set. Overall, the results indicate that the OTFM model generates higher quality data, as the FID is lower with respect to DDPM. While in term of realism OTFM beats DDPM on all metrics, the results on diversity are more metric dependent, with DDPM obtaining better results in MS-SSIM and PR. Our hypothesis is that this result is influenced by the presence of anatomically unplausible generations, which would improve diversity at the expense of realism. 
The qualitative analysis of the generated volumes (\figureref{fig:qualitative1}) shows that the OTFM model is more robust, leading to spatially consistent and realistic slices. Moreover, it confirms that the data generated with DDPM do not always represent the Quality Score with which they were conditioned and are sometimes not realistic and completely out of distribution. This is coherent with the quantitative analysis, as wrongly generated data could increase the diversity of the dataset while lowering the realism. Since both models use the same autoencoder model, the results are directly connected to the adopted denoising process.
The generated meshes depicted in \figureref{fig:mesh_qualitative} show that the main anatomical structures are built correctly, even for smaller details such as the cranial nerve holes visible in the QS~6 mesh or the ear canal visible in QS~2 and QS~3 ones. The resolution is high enough to reach a discrete smoothness of the meshes, even if it is not at the level of the real ones.


\begin{figure}[ht]
  \centering
  %\fbox{\rule{0pt}{2in} \rule{0.9\linewidth}{0pt}}
  \includegraphics[width=0.98\linewidth]{assets/slices_qualitative_captioned.png}

   \caption{Middle slice of the sagittal view of head CT scans. The first row shows real scans, the second shows volumes generated with DDPM, the third shows generations using OTFM. The generated scans have been chosen randomly from the respective datasets. Red borders highlight failed generations, while blue borders point out scans that do not depict the right QS, the one they were conditioned with.}
   \label{fig:qualitative1}
\end{figure}

\begin{table}[ht]
    \centering
    \floatconts
        {tab:quantitative results}%
        {\caption{\textbf{Quantitative evaluation of the generation quality.} Under the \textit{ref} setting, FID, precision, and recall are computed comparing the train and test sets of the real data, while MS-SSIM is calculated on the real training set. 
        % The best results among the generative experiments are highlighted in bold.
        }}
        {\vspace{-10pt}%\small
        \begin{tabular}{lcccccc}
            \toprule
            & FID $\downarrow$ & MS-SSIM $\downarrow$ & IP $\uparrow$ & IR $\uparrow$ & PP $\uparrow$ & PR $\uparrow$ \\ 
            \midrule
            ref & 0.0012 & 0.59 & 0.93 
              & 0.92 & 0.8 & 0.76
           \\
           \midrule
            DDPM & 0.073  & \textbf{0.59} & 0.55 & 0.73 & 0.27 & \textbf{0.76} \\
            
            OTFM & \textbf{0.0024}  & 0.63 & \textbf{0.96} & \textbf{0.82} & \textbf{0.87} & 0.58 \\
            
            \bottomrule
        \end{tabular}}
    
\end{table}
% REMOVED FOR MIDL
% \begin{figure}[ht]
%   \centering
%   \includegraphics[width=0.5\linewidth]{assets/mesh_qualitative.png}
%    \caption{Real and generated meshes. The first row shows real meshes, while the second row shows meshes generated with OTFM in the same position.}
%    \label{fig:mesh_qualitative}
% \end{figure}

\subsection{Downstream tasks}
\label{sec:downstream}
To further evaluate the generation quality we assess the performance of the synthetic datasets in two clinical downstream tasks: shape completion and skull alignment. These tasks are part of pre-surgical operations performed in maxillofacial surgery and are usually performed manually by an expert. The current method of treating malformations of the craniofacial skeleton requires dedicated technicians to manipulate 3D medical images, simulating the correction of the defect targeted by the surgery.
In this context, shape completion consists in automatically reconstructing defected skulls to speed up surgical planning and providing patient-specific hints~\cite{mazzocchetti_neural_2024}.  
To train this model all data must be aligned to a canonical reference, which is not the case in real clinical data due to differences in acquisition devices and patient positioning. 
For this reason, to use the shape completion tool in a clinical setting, we need another model trained to align skulls. 

Since data availability in the medical domain is often an issue, we decided to conduct the experiments in different scenarios, following the protocol adopted in \citet{wang20243dmeddiffusion3dmedical} to create the training sets:
\textbf{100\%~real} - there is no synthetic data, the model is trained with the original training set ($726$ samples); 
\textbf{100\%~synth} - there is no real data available, the model is trained with all the samples from the synthetic dataset ($726$ samples); 
\textbf{100\%~real + 25\%~synth} - an augmentation setting where we add to the original training set $25\%$ of the synthetic data samples ($726 + 182 = 908$ samples);
\textbf{50\%~real} - there is no synthetic data, the model is trained with half of the samples of the original training set to simulate data scarcity ($363$ samples); 
\textbf{50\%~real + 25\%~synth} - the new training set is made of half of the real training set and $25\%$ of the synthetic data samples to simulate augmentation in a data scarcity setting ($363 + 182 = 545$ samples);
% \textbf{synth} -  no real data available, the models are trained only with synthetic data; \textbf{aug-ds} - there is data scarcity (ds) and we augment 50\% of the real dataset with 25\% of the synthetic one; \textbf{aug} - the synthetic data is still used to augment the real data, but in this case we use 100\% of the real dataset and 25\% of the synthetic one.
% \textbf{synth} -  no real data, $100\%$ synthetic data; \textbf{aug-ds} - data scarcity (ds), $50\%$ of the real data + $25\%$ synthetic data; \textbf{aug} - $100\%$ real data + $25\%$ synthetic data.
% We set a baseline by training the model with all available real data in \textit{synth} and \textit{aug} settings, and with $50\%$ of the real data in \textit{aug-ds}.
With these experiments, we aim at measuring whether synthetic data can be used in place of real data and whether they are effective in augmenting existing datasets. Moreover, we designed another experiment to evaluate the possibility of using synthetic data to balance datasets. To do this, we used the model trained with OTFM to augment the real dataset so as to obtain a balanced one, leading to a total cardinality of $1060$ volumes, $212$ volumes for each Quality Score. To make a fair comparison, we also augmented the real dataset in a stratified fashion to obtain the same cardinality as the one in the balanced set, but keeping the QS distribution unbalanced.


\textbf{Aligner.}
The first task consists in aligning segmented CT scans to a canonical reference. 
% When training the model, all training data must be aligned to the reference to ensure that positional information can be effectively learned. As explained in \sectionref{sec:data}, our dataset is already aligned, and consequently the generated samples are as well. 
The neural network that learns the alignment is built on top of PointNet++~\cite{qi2017pointnet++} and it is described in detail in the supplementary, \sectionref{sec:aligner}.
% ELIMINATO PER MANCANZA SPAZIO
%composed of two modules: a first module that regresses a rotation matrix; and a second module that regresses both a refinement rotation matrix and a translation vector. A detailed description of the network architecture can be found in \cref{sec:aligner} of the supplementary material.
The model takes as input skulls represented as point clouds. Since the training set is already aligned, we adopt a self-supervised training strategy by applying random roto-translations to the input point clouds on-the-fly during training. The aligner is trained to regress the roto-translation matrix that recovers the ground-truth aligned position. The trained model is tested on a set of roto-translated skulls and evaluated with the average per point distance.

The quantitative results in \tableref{tab:align-otfm-ddpm} show that OTFM-generated data are more effective than DDPM-generated data for this clinical downstream task, as OTFM outperforms DDPM in all settings. 
When augmenting real data (\tableref{tab:align-otfm-real}), OTFM leads to substantial performance gains, especially in the setting that simulates data scarcity.
% As expected, when training using solely synthetic data the performance is slightly worse than when using only real data.
% As expected, in non-augmentation setting using real data is better than using synthetic data.
% Still, with OTFM the gap is more limited. In augmentation settings, the improvement with respect to real data is particularly relevant when the existing dataset is limited 
% (\textit{aug-ds})
% (\textcolor{red}{\textit{50\% real + 25\% synth}}), 
% which is quite common in the medical domain. 
% According to the results reported in \tableref{tab:aligner2}, which show 
Analysing the performance across each Quality Score, we noticed that the model is particularly sensitive to the QS of the skulls, performing poorly with QS~6 skulls, i.e. the least represented Quality Score. 
%That is why we decided to set up another augmentation experiment to analyse the possibility of using generative models to balance datasets. 
%As shown in Figure \Cref{fig:qs_dist}, the distribution of quality scores in the real dataset is highly unbalanced, with very few QS6 skulls. 
%We used OTFM generative model to augment the real dataset and obtain a balanced one, leading to a total cardinality of 1060 images, with 212 images for each quality score. To make a fair comparison, we also augmented the real dataset in a stratified fashion to obtain the same cardinality as the one in the balanced set. 
The results in \tableref{tab:aligner-dataset-balancing} show that balancing the dataset is very effective in solving this issue. The network, receiving many more QS~6 volumes, learns to align them more precisely leading to a better performance in that class. Moreover, the overall score improves, meaning that a balanced dataset helps the model to generalize and is not detrimental for previously over-represented classes.

% \begin{table}[ht]
%     \centering
%     \begin{minipage}[t]{0.35\textwidth}
%         \centering
%         \label{tab:aligner_1}
%         \caption{Aligner evaluation showing the average per point distance in \textit{mm} between roto-translated and ground truth skulls. 
%         %In \textit{synth} and \textit{aug} experiments the REAL baseline shows the results for the model trained with the complete training dataset, while in the \textit{aug-ds} experiment only 50\% of the training data is used.
%         }
%         \begin{tabular}{lccc}
%             \toprule
%             & synth & aug-ds & aug \\ \midrule
%             REAL & \textbf{6.95}            & 30.8                    & 6.95      \\ 
%             DDPM        & 8.11            & 9.73                    & 6.93      \\ 
%             OTFM     & 7.31            & \textbf{9.17}                    & \textbf{6.48}      \\ 
%              \bottomrule
%         \end{tabular}
%     \end{minipage}
%     \hfill
%     \small
%     \begin{minipage}[t]{0.62\textwidth}
%         \centering
%         \label{tab:aligner2}
%         \caption{Aligner dataset balancing experiment. OTFM and REAL rows show results for models trained only with real and synthetic data respectively. OTFM-S is the real dataset augmented in a stratified fashion. OTFM-B is the real dataset augmented to balance the classes cardinality.}
%         \begin{tabular}{lcccccc}
%             \toprule
%             & overall & QS 2 & QS 3 & QS 4 & QS 5 & QS 6 \\
%             \midrule
%             REAL & 6.95 & 6.47 & 4.85 & 6.53 & 7.08 & 21.6 \\ 
%             %OTFM & 7.31 & 6.72 & 6.21 & 6.42 & 6.57 & 21.03 \\ 
%             OTFM-S & 5.87 & \textbf{5.34} & 4.72 & \textbf{5.41} & \textbf{5.22} & 17.87 \\ 
%             OTFM-B & \textbf{5.67} & 5.54 & \textbf{4.60} & 5.71 & 5.28 & \textbf{8.12} \\
%             \bottomrule
%         \end{tabular}
        
        
        
%     \end{minipage}
% \end{table}

% \textcolor{red}{
% \begin{table}[htbp]
% \floatconts
%   {tab:two-tables}%
%   {\caption{\textbf{Aligner evaluation.} It shows the average per point distance in \textit{mm} between roto-translated and ground truth skulls. \textit{only real} refers to settings in which exclusively real data are used, corresponding to $100\%$ of the real training set in the first column and $50\%$ in the third.}}%
%   {%
%     {\vspace{-10pt}%\small % First table
%     % \begin{tabular}{lccc}
%     %         \toprule
%     %         & synth & aug-ds & aug \\ \midrule
%     %         real & \textbf{6.95}            & 30.8                    & 6.95      \\ 
%     %         ddpm        & 8.11            & 9.73                    & 6.93      \\ 
%     %         otfm     & 7.31            & \textbf{9.17}                    & \textbf{6.48}      \\ 
%     %          \bottomrule
%     % \end{tabular}
%     \begin{tabular}{lccc}
%             \toprule
%             only real    & 6.95~          &    -      & 30.8~      \\ 
%             ddpm    & 6.93          & 8.11      & 9.73       \\ 
%             otfm    & \textbf{6.48} & 7.31      & \textbf{9.17}      \\ 
%              \bottomrule
%     \end{tabular}
%     }
%   }
% \end{table}
% }

\begin{table}[htbp]
\floatconts
  {tab:align-otfm-ddpm}%
  {\caption{\textbf{Aligner evaluation: OTFM vs DDPM.} It shows the average per point distance in \textit{mm} between roto-translated and ground truth skulls.}}%
  {%
    {\vspace{-10pt}%\small % First table
    % \begin{tabular}{lccc}
    %         \toprule
    %         & synth & aug-ds & aug \\ \midrule
    %         real & \textbf{6.95}            & 30.8                    & 6.95      \\ 
    %         ddpm        & 8.11            & 9.73                    & 6.93      \\ 
    %         otfm     & 7.31            & \textbf{9.17}                    & \textbf{6.48}      \\ 
    %          \bottomrule
    % \end{tabular}
    \begin{tabular}{lccc}
            \toprule
                    & 100\% synth    &   \makecell{100\% real + 25\% synth}    &  \makecell{50\% real + 25\% synth} \\ \midrule
            DDPM    &  8.11      & 6.93          &  9.73       \\ 
            OTFM    & \textbf{7.31}   & \textbf{6.48} &  \textbf{9.17}      \\ 
             \bottomrule
    \end{tabular}
    }
  }
\end{table}



\begin{table}[htbp]
\floatconts
  {tab:align-otfm-real}%
  {\caption{\textbf{Aligner evaluation: OTFM vs real data.} It shows the average per point distance in \textit{mm} between roto-translated and ground truth skulls. \textit{real only} denotes that only real data have been used to train the skull alignment model.}}%
  {%
    {\vspace{-10pt}%\small % First table
    % \begin{tabular}{lccc}
    %         \toprule
    %         & synth & aug-ds & aug \\ \midrule
    %         real & \textbf{6.95}            & 30.8                    & 6.95      \\ 
    %         ddpm        & 8.11            & 9.73                    & 6.93      \\ 
    %         otfm     & 7.31            & \textbf{9.17}                    & \textbf{6.48}      \\ 
    %          \bottomrule
    % \end{tabular}
    \begin{tabular}{lccc}
            \toprule
                    & 100\% real       & 50\% real \\ \midrule
            real only    & 6.95~          & 30.8~      \\  
            $+25\%$ OTFM    & \textbf{6.48}   & \textbf{9.17}      \\ 
             \bottomrule
    \end{tabular}
    }
  }
\end{table}





\begin{table}[htbp]
\floatconts
  {tab:aligner-dataset-balancing}%
  {\caption{\textbf{Aligner dataset balancing experiment.} Average per-point distance in \textit{mm} between roto-translated and ground truth skulls. \textit{OTFM-s} is the real dataset augmented in a stratified fashion. \textit{OTFM-b} is the real dataset augmented to balance the classes cardinality.}}%
  {%
    {\vspace{-10pt}%\small % First table
    % \begin{tabular}{lccc}
    %         \toprule
    %         & synth & aug-ds & aug \\ \midrule
    %         real & \textbf{6.95}            & 30.8                    & 6.95      \\ 
    %         ddpm        & 8.11            & 9.73                    & 6.93      \\ 
    %         otfm     & 7.31            & \textbf{9.17}                    & \textbf{6.48}      \\ 
    %          \bottomrule
    % \end{tabular}
    % \textcolor{red}{
    % \begin{tabular}{lccc}
    %         \toprule
    %                 & \makecell{\colorbox{lightblue}{~~~~~~~~~~100\% real~~~~~~~~~~}\\\colorbox{lightgray}{(100\% real + 25\% synth)}}    & 100\% synth        & \makecell{\colorbox{lightblue}{~~~~~~~~~~50\% real~~~~~~~~~~}\\\colorbox{lightgray}{(50\% real + 25\% synth)}} \\ \midrule
    %         real    & \cellcolor{lightblue}6.95          &    -      & \cellcolor{lightblue}30.8      \\ 
    %         ddpm    & \cellcolor{lightgray}6.93          & 8.11      & \cellcolor{lightgray}9.73       \\ 
    %         otfm    & \cellcolor{lightgray}\textbf{6.48} & 7.31      & \cellcolor{lightgray}\textbf{9.17}      \\ 
    %          \bottomrule
    % \end{tabular}
    % \begin{tabular}{lccc}
    %         \toprule
    %                 & \makecell{100\% real\\\colorbox{lightgray}{~~(+ 25\% synth)~~}}    & 100\% synth        & \makecell{50\% real\\\colorbox{lightgray}{~~(+ 25\% synth)~~}} \\ \midrule
    %         real    & 6.95          &    -      & 30.8      \\ 
    %         ddpm    & \cellcolor{lightgray}6.93          & 8.11      & \cellcolor{lightgray}9.73       \\ 
    %         otfm    & \cellcolor{lightgray}\textbf{6.48} & 7.31      & \cellcolor{lightgray}\textbf{9.17}      \\ 
    %          \bottomrule
    % \end{tabular}
    % }
    % \textcolor{red}{
    % \begin{tabular}{lccc}
    %         \toprule
    %                 & \makecell{100\% real\\(+25\% synth)*}    & 100\% synth        & \makecell{50\% real\\(+25\% synth)*} \\ \midrule
    %         real    & 6.95~          &    -      & 30.8~      \\ 
    %         ddpm    & 6.93*          & 8.11      & 9.73*       \\ 
    %         otfm    & \textbf{6.48}* & 7.31      & \textbf{9.17}*      \\ 
    %          \bottomrule
    % \end{tabular}
    % }
    % \hfill
    % Second table
    \begin{tabular}{lcccccc}
            \toprule
            & overall & QS 2 & QS 3 & QS 4 & QS 5 & QS 6 \\
            \midrule
            100\% real & 6.95 & 6.47 & 4.85 & 6.53 & 7.08 & 21.6 \\ 
            %OTFM & 7.31 & 6.72 & 6.21 & 6.42 & 6.57 & 21.03 \\ 
            OTFM-s & 5.87 & \textbf{5.34} & 4.72 & \textbf{5.41} & \textbf{5.22} & 17.87 \\ 
            OTFM-b & \textbf{5.67} & 5.54 & \textbf{4.60} & 5.71 & 5.28 & \textbf{8.12} \\
            \bottomrule
    \end{tabular}}
  }
\end{table}


\textbf{Shape completion.}
To automatically reconstruct malformed skulls we followed the work of \citet{wang2022pointattnneedattentionpoint} and used PointAttN, a network that exploits attention to construct a point cloud completion network. In order to generate the defects we replicated the pipeline of \citet{mazzocchetti_neural_2024}, which consists in the removal from the input point cloud of cuboids sized between $3cm$ and $10cm$. The partial point clouds used for training are generated online, while validation and test clouds are kept fixed throughout all the experiments. To evaluate the model we used Chamfer Distance (CD), Accuracy and Completeness, which were computed considering just the defective region. 

The results reported in \tableref{tab:shape-completion-otfm-ddpm} and \tableref{tab:shape-completion-otfm-real} highlight that the models trained with OTFM-generated data outperform the ones trained with DDPM-generated data, and that augmenting the real dataset improves performance.
% capability of the synthetic data to substitute and augment real ones. 
% \textcolor{red}{When the model is trained using solely OTFM synthetic data, its performance is very close to the results obtained with real data.}
% Even in non-augmentation setting, the performance of the models trained using synthetic data generated with OTFM is very close to the results obtained with real data.
Since we found experimentally that the performance of the model does not vary greatly for different QSs, balancing the dataset is less critical in this task. Therefore, the results on the balancing experiment are reported in the supplementary,  \sectionref{sec:shape_completion}.

\begin{table}[ht]
\centering
\floatconts
    {tab:shape-completion-otfm-ddpm}%
    {\caption{\textbf{Shape completion evaluation: OTFM vs DDPM.} For each experimental setting we show Chamfer Distance (CD), Accuracy and Completeness in millimeters.} 
    }
    {\vspace{-10pt}%\small
    \resizebox{\textwidth}{!}{
    \begin{tabular}{lccccccccc}
            \hline
            & \multicolumn{3}{c}{100\% synth} & \multicolumn{3}{c}{100\% real + 25\% synth} & \multicolumn{3}{c}{50\% real + 25\% synth} \\ \hline
            & CD $\downarrow$      & Acc $\downarrow$    & Comp $\downarrow$   & CD $\downarrow$      & Acc $\downarrow$     & Comp $\downarrow$  & CD $\downarrow$    & Acc $\downarrow$   & Comp $\downarrow$ \\ \hline
            DDPM    & 3.50    & 4.34   & 2.66    & 3.30   & 4.04   & 2.56 & 3.37    & 4.15    & \textbf{2.60} \\
            OTFM & \textbf{3.36}    & \textbf{4.09}   & \textbf{2.63} & \textbf{3.17}   & \textbf{3.81}   & \textbf{2.54} & \textbf{3.31}    & \textbf{3.99}    & 2.63  \\ \hline
    \end{tabular}}
    }
            
\end{table}

\begin{table}[ht]
\centering
\floatconts
    {tab:shape-completion-otfm-real}%
    {\caption{\textbf{Shape completion evaluation: OTFM vs real data.} For each experimental setting we show Chamfer Distance (CD), Accuracy and Completeness in millimeters. \textit{real only} denotes that only real data have been used to train the shape completion model.}
    }
    
    {\vspace{-10pt}%\small
    \begin{tabular}{lccccccccc}
            \hline
            & \multicolumn{3}{c}{100\% real} &  \multicolumn{3}{c}{50\% real} \\ \hline
            & CD $\downarrow$      & Acc $\downarrow$    & Comp $\downarrow$   &  CD $\downarrow$    & Acc $\downarrow$   & Comp $\downarrow$ \\ \hline
            real only & 3.29    & 3.98   & 2.6  &  3.41    & 4.12    & 2.69  \\
            $+25\%$ OTFM & \textbf{3.17}   & \textbf{3.81}   & \textbf{2.54} & \textbf{3.31}    & \textbf{3.99}    & \textbf{2.63}  \\ \hline
    \end{tabular}
    }
            
\end{table}


% \begin{table}[ht]
% \centering
% \floatconts
%     {tab:shape-completion}%
%     {\caption{\textbf{Shape completion evaluation.} For each experimental setting we show Chamfer Distance (CD), Accuracy and Completeness in millimeters. 
%     %In \textit{synth} and \textit{aug} experiments the REAL baseline shows the results for the model trained with the complete training dataset, while in the \textit{aug-ds} experiment only 50\% of the training data is used. 
%     }}
    
%     {\vspace{-10pt}%\small
%     % \begin{tabular}{lccccccccc}
%     %         \hline
%     %         & \multicolumn{3}{c}{synth} & \multicolumn{3}{c}{aug-ds} & \multicolumn{3}{c}{aug} \\ \hline
%     %         & CD      & Acc    & Comp   & CD      & Acc     & Comp   & CD     & Acc    & Comp  \\ \hline
%     %         real & \textbf{3.29}    & \textbf{3.98}   & \textbf{2.6}   & 3.41    & 4.12    & 2.69   & 3.29   & 3.98   & 2.6  \\
%     %         ddpm & 3.50    & 4.34   & 2.66   & 3.37    & 4.15    & \textbf{2.60}   & 3.30   & 4.04   & 2.56  \\
%     %         otfm & 3.36    & 4.09   & 2.63   & \textbf{3.31}    & \textbf{3.99}    & 2.63   & \textbf{3.17}   & \textbf{3.81}   & \textbf{2.54}  \\ \hline
%     % \end{tabular}
%     \textcolor{red}{
%     \begin{tabular}{lccccccccc}
%             \hline
%             & \multicolumn{3}{c}{\makecell{100\% real\\(+25\% synth)*}} & \multicolumn{3}{c}{100\% synth} & \multicolumn{3}{c}{\makecell{50\% real\\(+25\% synth)*}} \\ \hline
%             & CD ↓      & Acc ↓    & Comp ↓   & CD ↓      & Acc ↓     & Comp ↓  & CD ↓    & Acc ↓   & Comp ↓ \\ \hline
%             real & 3.29    & 3.98   & 2.6  & -   & -  & - & 3.41    & 4.12    & 2.69  \\
%             ddpm    & 3.30*   & 4.04*   & 2.56* & 3.50    & 4.34   & 2.66    & 3.37*    & 4.15*    & \textbf{2.60}* \\
%             otfm & \textbf{3.17}*   & \textbf{3.81}*   & \textbf{2.54}* & 3.36    & 4.09   & 2.63 & \textbf{3.31}*    & \textbf{3.99}*    & 2.63*  \\ \hline
%     \end{tabular}}
%     }
            
% \end{table}



% REMOVED FOR MIDL
\subsection{Failure cases}
There are few cases in which OTFM fails at generating accurate data. By qualitatively analysing these cases, we can define two main sources of defects: those connected to the nature and quality of the training data, and those related to the generation process. An example mesh of the first type of error, showing anatomically unplausible holes in the skeletal structure, is depicted in the left side of \figureref{fig:mesh_failure}. As explained in \sectionref{sec:data}, the thresholding applied to obtain the skeletal component of a CT scan can lead to holes in the resulting skull. This means that the training set data will contain examples with non-anatomical holes as well. As a result, the model will learn to imitate this feature, even if it is an unwanted one.
As regards errors emerging from the generative process, a second example reported in the right side of \figureref{fig:mesh_failure} shows defects connected to the generation of high-frequency details.
More in detail, the generated QS~5 mesh contains dental arches that are not sufficiently sharp, resulting in smoother volumes with less anatomical precision. 

% REMOVED FOR MIDL
% \begin{figure}[ht]
%     \centering
%     \includegraphics[width=0.4\linewidth]{assets/failures.png}
%     \caption{Generated meshes with some failures: on the left, a QS 2 mesh with anatomically inaccurate holes; on the right a QS 5 mesh that shows smooth dental arches.}
    
% \end{figure}

\begin{figure}[ht]
\floatconts
  {fig:qualitative_3d}%
  {\caption{(a): Real and generated meshes. 
  % The first row shows real meshes, while the second row shows meshes generated with OTFM in the same position. 
  (b): Generated meshes with some failures.
  % : on the left, a QS 2 mesh with anatomically inaccurate holes; on the right a QS 5 mesh that shows smooth dental arches.
  }}
  {
    \subfigure{
      \label{fig:mesh_qualitative}%
      \includegraphics[width=0.45\textwidth]{assets/mesh_qualitative.png}%
    }\hfill
    \subfigure{%
      \label{fig:mesh_failure}%
      \includegraphics[width=0.48\textwidth]{assets/failures.png}%
    }
  }
\end{figure}

%While OT-FM obtains generally better generations with respect to DDPM, the qualitative analysis shows that this is not always the case. Some unwanted features that are present in the training set, \eg an excessive amount of holes in the bone structure, tend to appear more often in images generated with FM. The model trained with DDPM learning objective tends to generate more defective images, but Figure \cref{fig:rob_vs_qual} shows that sometimes it reaches peaks of quality that are not reached by the OT-FM counterpart. The experiments do not directly explain this behaviour, but we hypothesize that it is related to the trajectories taken in the denoising process.

%AGGIUNGERE IMMAGINE MESHES