\appendix

\section{Evaluating the gaze-guided feature extractor (GFE)}
\label{gfe_eval}

We fine-tuned and evaluated the GFE proposed by \citet{strohm21_iccv} for mental human-like face reconstruction from FaceMaker~\citep{schwind2017facemaker}.
In their work they utilised knowledge about the target face to generate sets of six auxiliary faces, ensuring that the target facial features were reflected in the faces shown to the user.
In stark contrast, our work does not use prior knowledge, therefore target features are not necessarily reflected in the shown faces.
As this might influence the gaze behaviour of users, we collected additional training data to fine-tune their GFE.

\subsection{Data Collection}
\label{sec:data_collection}
We recorded gaze data of 7 female and 3 male participants aged between 19 and 35 years (M=27.1, SD=4.7).
They were recruited through university mailing lists and compensated for their participation with 15€ per hour.
The eyesight of all participants was normal or corrected-to-normal.
We recorded their binocular gaze at 2,000Hz using a stationary EyeLink 1000 Plus eye tracker.
To increase gaze tracking accuracy, we used a chin rest stabilising participants' head.
A 24.4-inch screen with a resolution of $1920\times1080$ pixels was placed 90cm in front of the participants to show the face stimuli.
Each stimuli covered $8.9^{\circ}$ degrees of visual angle.

Once participants agreed to our consent form they completed three trials.
Each trial started with a calibration-validation procedure to ensure accurate eye tracking.
Following this, a target face was shown for 30 seconds that participants had to memorise.
After the memorisation phase we iteratively displayed ten sets of six auxiliary faces for 30 seconds each and gave participants the task to rank the faces from one to six according to similarity with their mental image.
All face stimuli were generated randomly and independently of each other by sampling the FaceMaker features from a uniform distribution.

\subsection{Fine-tuning and evaluation}
We fine-tuned the GFE using 28 out of 30 trials we collected for each participant.
\citet{strohm21_iccv} were able to train their GFE to perform binary classification, since a feature either was the target feature or not.
However, since our work does not use prior knowledge, we cannot create binary labels and as such fine-tune their GFE using continuous labels by minimising the mean squared error\footnote{The architecture and all hyper-parameters are equivalent to \citet{strohm21_iccv}.}. The labels are created by calculating the similarity $1-\text{abs}(f_M-f_i)$ between each feature $f_i \in F_i$ and the target features $f_M \in F_m$.

Using the remaining two trials of each participant we estimated $P_\text{GFE}$ by counting how often the model selects the best feature, the second best and so on.
Table \ref{tbl:estimating_P} shows the estimated probability distribution $P_\text{GFE}$ for the original model~\citep{strohm21_iccv} and our model fine-tuned with additional training data and continuous labels.
We observe that by adding more task-specific training data, the GFE improved in performance, where the probability of extracting a top-3 feature increases from 61.1\% to 68.2\%.

\begin{table}[t]
\caption{Comparison of $P_\text{GFE}$ (i.e., the probability distribution for selecting the k-th best feature value) for the pre-trained model by \citet{strohm21_iccv} and our fine-tuned model. For our model we observe that the probability of extracting a top-3 feature increases from 61.1\% to 68.2\%.}
\begin{center}
\begin{tabular}{lcccccc}
\toprule
 &  \multicolumn{6}{c}{Similarity Rank}\\
\cmidrule(r){2-7}
Model & 1 & 2 & 3 & 4 & 5 & 6\\
\midrule
\citet{strohm21_iccv} & 18.2\% & 20.2\% & 22.7\% & 16.9\% & 12.9\% & 9.1\% \\
Ours          & 22.9\% & 21.8\% & 23.5\% & 16.7\% & 10.0\% & 5.1\% \\
\bottomrule
\end{tabular}
\label{tbl:estimating_P}
\end{center}
\end{table}

\begin{figure}[t]
\floatconts
  {fig:query_engine_distribution}
  {\caption{The plots show the probability of query engines to propose a certain auxiliary feature value. (a) shows the feature value distribution for the query engine trained using our estimated values $P_\text{GFE$_0$}$, while (b) shows the feature value distribution for a better estimate $P_\text{GFE$_2$}$.}}
  {%
    \subfigure[]{\label{fig:sub1}%
      \includegraphics[width=0.47\linewidth]{figures/query_engine_distribution.pdf}}
    \qquad
    \subfigure[]{\label{fig:sub2}%
      \includegraphics[width=0.47\linewidth]{figures/query_engine_distribution_high_probs.pdf}}
  }
\end{figure}

\section{Query Engine Performance}
\label{sec:qe_analysis}
To gain a better understanding of how GBC-MIR works, we analyse the behaviour of our query engine.
Figure \ref{fig:query_engine_distribution} shows the predicted feature value distribution for two ten-iteration query engines trained using $P_\text{GFE$_0$}$ (Figure \ref{fig:sub1}) and $P_\text{GFE$_2$}$ (Figure \ref{fig:sub2}).
We observe that our query engine in Figure \ref{fig:sub1} learned to split the feature space into six roughly equidistant areas, due to the six faces shown.
When our query engine is trained with $P_\text{GFE$_0$}$, it learns the strategy to take multiple samples from the same area in order to increase its confidence in a feature's rank, rather than searching the entire feature space for the exact values of the mental image. This can be seen in the higher probability peaks and the surrounding gaps of zero probability.
Figure \ref{fig:sub2} shows the predicted feature value distribution for our query engine trained with the better simulated GFE $P_\text{GFE$_2$}$.
We observe a more uniform distribution of the predictions across the feature space, as there are no longer any surrounding gaps of zero probabilities.
With higher confidence in the selected feature's rank, the query engine does not require redundant samples and can use a more fine-grained search strategy.

\section{Usability and Workload}
\label{sec:usability}
\paragraph{System Usability Scale}
We asked participants to fill in a System Usability Scale (SUS) questionnaire~\citep{brooke1996sus} to assess the usability of our three conditions.
The SUS allows us to calculate a single score on a scale from 0-100 (higher is better), by answering ten standardised questions on a five point Likert scale.
The resulting SUS scores are 52 for the method by \citet{zaltron2020cg}, 67 for our proposed system and 76 for FaceMaker.
A repeated measures one-way ANOVA test~\citep{gueorguieva2004move} 
indicated a significant difference (p=0.007) between the SUS scores of the three conditions.
Pairwise Tukey's HSD post-hoc tests~\citep{abdi2010tukey} 
indicate a significant difference only between the method by \citet{zaltron2020cg} and FaceMaker (p=0.026).
Overall these results tend towards a better usability of GBC-MIR compared to Zaltron~et~al..

\paragraph{NASA-TLX Workload}
Additionally to the SUS, we asked participants to complete the Raw-NASA-TLX questionnaire~\citep{hart2006nasa}, allowing us to assess the perceived workload of the participants for each condition on a scale from 1-100 (lower is better).
The resulting values are 46 for the method by \citet{zaltron2020cg}, 47 for our proposed system and 30 for FaceMaker.
A repeated measures one-way ANOVA test~\citep{gueorguieva2004move} 
indicate a significant difference (p=0.002) between the TLX values of the three conditions.
Pairwise Tukey's HSD post-hoc tests~\citep{abdi2010tukey} 
indicate a significant difference between the method by Zaltron~et~al. and FaceMaker (p=0.022) as well as between our method and FaceMaker (p=0.015).
Overall, these results suggest that the usage of FaceMaker was the least demanding task, while there is no significant difference between GBC-MIR and the method of Zaltron~et~al..