\section{Analysis and discussion}

\paragraph{Reconstruction quality.}

Our results show that gaze-based mental image reconstruction without prior knowledge about the target is possible and outperforms the previous best manual reconstruction method by $\sim$15\% in MAFD (see Table \ref{tbl:results}).
An additional user study to evaluate the reconstructions underlines this as users tend to rate our reconstructions as more similar to the target image than the reconstructions based on the method by \citet{zaltron2020cg}.
However, analysing the user feedback and comparing qualitative results to the FaceMaker~\citep{schwind2017facemaker} control condition indicate that there is still room for improvement.
As faces are perceived holistically~\citep{frowd2004evofit}, a few mismatched features might make a face unrecognisable or, in our case, appear vastly different than the target image.
Furthermore, we observed a large variance in the predictions of our system with up to 16 in MAFD for nose related features. 
This variance can be partly explained by humans' differing ability to memorise images, and in particular faces~\citep{verhallen_general_2017}, as the variance for the FaceMaker control condition is similarly high.
This is likely also the reason for the relatively high MAFD of the FaceMaker control method, since participants recreated their mental image of the target, not the target face directly.
Finally, we evaluated the SUS questionnaires for each condition. The resulting SUS scores are 52 for the method by \citet{zaltron2020cg}, 67 for our proposed system and 76 for FaceMaker (see Appendix \ref{sec:usability} for details). Overall these results tend towards a better usability of GBC-MIR compared to \citet{zaltron2020cg}. 

\begin{figure}[t]
    \centering
    \includegraphics[width=0.6\columnwidth]{figures/MASD_plot.pdf}
    \caption{Simulated mean absolute feature distance (MAFD) over number of iterations for different probability distributions of the feature extractor. $P_\text{GFE$_0$}$ are the probabilities estimated for our gaze-guided feature extractors while $P_\text{GFE$_1$}$ and $P_\text{GFE$_2$}$ are better, theoretical versions. $P_\text{equal}$ represents a random feature extractor while $P_\text{ideal}$ always selects the best feature value possible.}
    \label{fig:MAFD_plot}
\end{figure}

\paragraph{Feature extractor importance.}
Since we simulate the GFE during training of GBC-MIR by replacing it with a selection layer, we are able to analyse the influence of different probability distributions $P_\text{GFE}$ on the performance.
Figure \ref{fig:MAFD_plot} shows the simulated MAFD for different probability distributions $P$ of the feature extractor over number of iterations.
At the extremes, $P_\text{ideal}$ assumes a feature extractor always selecting the best possible feature, while $P_\text{equal}$ has an equal chance of selecting any feature.
The MAFD for $P_\text{equal}$ remains at a constant value of 52.56 independent of the number of iterations.
Since the system does not receive any information about the target due to the random feature extractor, our network learns to reconstruct the mean face.
$P_\text{GFE$_0$}$ shows the simulated MAFD for the probabilities we estimated for the pre-trained GFE used in our system.
To analyse how GBC-MIR scales with the quality of the GFE, we simulated results with improved versions $P_\text{GFE$_{1,2}$}$, as shown in Figure \ref{fig:MAFD_plot}.
For $P_\text{GFE$_{1}$}$, we subtracted 10\% from the probabilities of selecting sub-optimal features (ranks 2-6) and added it to the probability of selecting the best feature (rank 1). In the case of $P_\text{GFE$_2$}$, 20\% was redistributed instead. 
As can be observed in Figure \ref{fig:MAFD_plot}, the MAFD of GBC-MIR scales well with the performance of the feature extractor. The MAFD of $P_\text{GFE$_1$}$ is consistently lower over the number of iterations compared to $P_\text{GFE$_0$}$ and analogously between $P_\text{GFE$_2$}$ and $P_\text{GFE$_1$}$.
We found that the query engine learns different strategies depending on the GFE performance. For unreliable GFEs it repeatedly samples six equidistant areas in the feature space to determine a value range that is most likely. More reliable GFEs allow to explore the feature space more thoroughly as less repeated measurements to increase confidence are required (see Appendix \ref{sec:qe_analysis} for details). 