\section{Reconstruction experiments}

\definecolor{myGreen}{HTML}{006600}
\definecolor{myOrange}{HTML}{ff8000}
\definecolor{myRed}{HTML}{FF0000}

For evaluation, we used the pre-trained GFE and decoder provided by \citet{strohm21_iccv} and thus evaluate GBC-MIR in the FaceMaker face image domain~\citep{schwind2017facemaker}. 
FaceMaker is an interactive tool to manually generate human-like faces by manipulating a set of $28$ face appearance sliders, giving fine grained control over different facial features.
We compare GBC-MIR with a baseline method and a FaceMaker control.

\subsection{Methods}
\paragraph{Ours -- Gaze-based collaborative mental image reconstruction (GBC-MIR).}
To train our system we estimated the values $P_\text{GFE}$ of the pre-trained GFE model for the selection layer (see appendix).
As we have to reconstruct 28 features and the used GFE requires six input images, the dense layers of our system depicted in Figure \ref{fig:neural_implementation} consist of $28*6=168$ neurons with a Sigmoid activation.
The projection layers consist of $50$ neurons with a ReLU activation, projecting the $168$ image features into a $50$ dimensional space.
The recurrent layers are implemented by gated recurrent unit (GRU)~\citep{cho2014learning} layers, each extracting $50$ features from the history of selected and projected features, respectively.
We set the number of iterations to $m=10$, resulting in ten stacked iteration modules.
The output dense layer consist of $28$ neurons with Sigmoid activation.

The model was trained for 50 epochs on a V100 GPU with a batch size of 32, where each batch consists of randomly generated target feature vectors of size $28$.
We used the Adam optimiser~\citep{kingma2014adam} with a learning rate of 0.0004 and default parameters otherwise.
For the user study, we replaced the selection layer with the pre-trained GFE.
%The $28*6$ predicted values of each dense layer are decoded into pixel space allowing us to show the six images to the users and to input them into the GFE with the collected gaze data.
%We use the pre-trained decoder that takes the predicted image features as an input and generates the corresponding image.

As in \citet{strohm21_iccv}, the task of the participants in each iteration was to rank the six auxiliary images according to the similarity with their mental image.
They had 30 seconds to complete this task after which the gaze data was used to extract features with the GFE.
Since the used GFE requires 30 seconds of gaze recordings to extract image features, and we used ten iterations, the resulting total reconstruction time amounted to five minutes.
This keeps the interaction time with the system short, reducing possible eye strain and ensuring fast reconstruction times.

\paragraph{Baseline -- Interactive evolution.}
As the baseline we compared to an adapted version of the state-of-the-art mental image reconstruction method proposed by \citet{zaltron2020cg}.
This system initially proposes nine randomly generated faces.
Users then have the option to either generate new random faces or mutate faces:
Any number of the proposed faces can be locked so that they are not replaced by new random faces.
Similarly, any number of faces can be selected for mutation.
For this, the average of the selected faces is calculated and new faces are generated by adding random noise to this average face.
Participants can control the amount of this noise, i.e., how large the changes should be.
Additionally, participants can choose if they want all features to be changed while mutating or only a single random feature.
This system can be used to iteratively traverse the latent space until participants decide that one of the proposed faces adequately resembles their mental image.
The original method by Zaltron~et~al. also allowed for explicit manipulation of certain face attributes but given that this requires a dataset with labelled attributes, it is not possible to consider this functionality.

\paragraph{Control -- FaceMaker.}
FaceMaker is a tool to manually generate face images using a set of sliders and is used in this work as a control.
More specifically, starting with a mean face, participants can manipulate 28 facial features by moving an associated slider.
FaceMaker updates the face in real time. 
Users can adjust these sliders until they are satisfied with the created face resembling their mental image sufficiently well.
Since \citet{strohm21_iccv} used FaceMaker to train the GFE and feature decoder, and it allows for full control over the generation of the faces, we use it as a control condition.

\subsection{User study}
\label{sec:eval_study}

We compared these three methods in a user study with 12 participants (four female) aged between 18 and 29 years (M=23.5, SD=3.6).
All participants had normal or corrected-to-normal vision and were recruited through university mailing lists.
For our GBC-MIR method, binocular gaze was recorded at 2,000 Hz using a stationary EyeLink 1000 Plus eye tracker.
To increase gaze tracking accuracy, we used a chin rest to stabilise participants' heads.
Each stimuli covered about $8.9^{\circ}$ degrees of visual angle.
We counterbalanced the conditions using a within-group Latin square study design. 

For each method, participants had to complete three trials in succession.
Each of these trials started with a 30 second memorisation step of the target face.
After memorisation, participants were instructed to reconstruct the face using the respective method.
To prevent memorisation effects of the target faces between methods, each target face was used for each method between groups, not within.
After completing three trials of one method they were asked to complete a System Usability Scale (SUS)~\citep{brooke1996sus} and NASA-TLX questionnaire~\citep{hart2006nasa}.
Once participants finished all nine trials, they were briefly interviewed about the different systems and finally compensated for their participation.

\begin{table}[t]
\caption{Mean absolute feature distance (MAFD) for different facial regions of our gaze-based method compared to the baseline by \citet{zaltron2020cg} and the FaceMaker control condition.}
\label{tbl:results}
\begin{center}
\begin{tabular}{lccccc}
\toprule
 &  \multicolumn{5}{c}{MAFD}\\
\cmidrule(){2-6}
Method & Eyes & Nose & Mouth & Jaw & Overall\\
\midrule
Ours & $\mathbf{46.3 \pm 8}$ & $\mathbf{44.5 \pm 16}$ & $\mathbf{41.8 \pm 12}$ & $\mathbf{45.7 \pm 14}$ & $\mathbf{40.9 \pm 6}$  \\
\citet{zaltron2020cg} & $55.8 \pm 13$ & $50.6 \pm 17$ & $54.9 \pm 15$ & $53.0 \pm 22$ & $48.0 \pm 7$  \\
\midrule
FaceMaker & $30.0 \pm 9$ & $33.5 \pm 11$ & $28.9 \pm 10$ & $33.6 \pm 11$ & $29.6 \pm 6$  \\
\bottomrule
\end{tabular}
\end{center}
\end{table}

\subsection{Reconstruction results}
\paragraph{Metrics.}
We report the mean absolute feature distance (MAFD) defined as
\[
\text{MAFD} = \frac{1}{28} \sum_{i=1}^{28} |f^p_i-f^M_i|, \tag{2} \label{eq:MAFD}
\]
where $f^p_i$ is the predicted value and $f^M_i$ is the target value for feature $f$.
The feature values for $f$ are normalised between 1 and 182 as this is the range given by FaceMaker.
Since we calculate the distance between prediction and target feature value a lower MAFD is better.
In addition, we report performance for the facial areas \textit{Eyes}, \textit{Nose}, \textit{Mouth} and \textit{Jaw} by grouping features as defined by \citet{strohm21_iccv}.

\paragraph{Results.}
Table \ref{tbl:results} shows the performance of our system compared to the baselines.
As can be seen, our gaze-based method achieves a MAFD of $40.9$ compared to $48.0$ by the method of \citet{zaltron2020cg}, representing an improvement of about 15\%. 
Using the FaceMaker control condition participants achieved an overall MAFD of $29.56$.
This MAFD is quite considerable given that participants had full and explicit control of the face generation process using FaceMaker, underlining the difficulty of the face reconstruction task.

Figure \ref{fig:reconstructions} shows four sample target faces, reconstructions produced by our gaze-based method, as well as baseline reconstructions produced with the method by Zaltron~et~al. and FaceMaker. Colour-coded feature groups indicate the reconstruction quality of the different facial regions. 
The colour codes represent three equidistant bins between the minimum and maximum MAFD of the test set reconstructions.
We can observe that our method can produce visually plausible mental image reconstructions for most facial regions.
Furthermore, as indicated by the MAFD, participants struggle to reconstruct some facial features even when having full control with FaceMaker.

\begin{figure}[t]
    \centering
    \includegraphics[width=\linewidth]{figures/reconstructions.png}
    \caption{Example reconstructions created during the user study. It shows the reconstructions with our gaze-based collaborative system GBC-MIR (Ours) compared to the two baselines. Colour-coded labels (\textcolor{myGreen}{high}, \textcolor{myOrange}{medium}, or \textcolor{myRed}{low}) indicate the reconstruction quality of different facial regions. The colour codes represent three equidistant bins between the minimum and maximum MAFD of the test set reconstructions.}
    \label{fig:reconstructions}
\end{figure}

\subsection{Qualitative evaluation}
\label{sec:userstudy}

We conducted an additional 22-participant user study to assess the subjective quality of our reconstructions.
Participants were shown sets of four faces, each containing a target face from our test set as well as the reconstructions from the three methods in random order.
Participants were asked to rank the three reconstructions from most similar (rank one) to least similar (rank three) according to their similarity with the target.
We also asked them to provide text explanations for the reasoning behind their ranking.
The average rank given to our reconstructions was $2.28\pm0.7$ while the method by \citet{zaltron2020cg} ranked $2.40\pm0.7$ and the FaceMaker control $1.33\pm0.5$.