\section{Gaze-based collaborative mental image reconstruction (GBC-MIR)}
\label{sec:method}

\begin{figure}[t]
    \centering
    \includegraphics[width=0.7\linewidth]{figures/overall_architecture.pdf}
    \caption{Overview of our method for gaze-based collaborative mental image reconstruction (GBC-MIR). With a specific mental image in mind, users observe auxiliary images proposed by a \textit{query engine} and visually reconstructed by a \textit{decoder}. A \textit{gaze-guided feature extractor} extracts image features that are relevant to the mental image. Based on these features, the query engine proposes image features for the next iteration. This loop continues for several iterations until all extracted feature vectors are combined to reconstruct the final mental image.}
    \label{fig:overall_architecture}
\end{figure}

The computational task that we study is that of visually reconstructing mental images from eye gaze fixations without prior knowledge of the target.
We approach this by showing multiple, generated auxiliary images to a user and recording their gaze while they look at these faces.
The auxiliary images are encoded and relevant image features are extracted based on the user's gaze behaviour.
The extracted features can subsequently be decoded to reconstruct the mental image.
\citet{strohm21_iccv} formulated this task as a mapping
$\{(I_i,G_i)|i=1...n\} \mapsto I_M$
, where $I_i$ are the auxiliary images shown to the user, $G_i$ are eye gaze fixations of that user on these images, and $I_M$ is the to-be-reconstructed mental (target) image that the user has in mind.
The fundamental difficulty of this task is to extract sufficient information about the mental image from the fixations on the given set of auxiliary images.
They used a small set of six auxiliary images, generated using prior knowledge.
This ensured that enough information about the target could be extracted from the gaze data.
However, given that information about the target image is not available for most real world use cases (e.g., facial composites in law enforcement), this renders their method impractical for actual use.

To address this fundamental limitation, we present a collaborative human-AI system for mental image reconstruction that does not require any prior knowledge of the target image. 
Instead, our method iteratively reconstructs the target image by presenting sets of images to the user, thus performing a mapping
$\{(I_{i,j},G_{i,j})|i=1...n\} \mapsto F_{M,j}$, $j = 1...m$,
where $F_{M,j}$ represents extracted gaze-guided image features from iteration $j \in [1,2,...,m]$. 
After $m$ iterations, in which our query engine dynamically selects image features based on the user's gaze, our system combines the extracted features of each round to generate the target image.
The overall architecture of our system is shown in Figure \ref{fig:overall_architecture} and its core components are described in the following.

\paragraph{Gaze-guided feature extractor (GFE).}
A central component of our collaborative system is the gaze-guided feature extractor (GFE).
It is represented by a pre-trained model that implements the function 
$\{(I_i,G_i)|i=1...n\} \mapsto F$,
i.e., given a set of auxiliary images $I$, it extracts a single feature vector $F$ based on the user's gaze fixations $G$. 
The extractor decides, for each feature dimension $f \in F$, from which auxiliary image $I_i$ to select a feature.
That is, $F$ is composed of features from the set of auxiliary images $I$.

\paragraph{Decoder.}
The decoder maps feature vectors $F$ into the image domain $I$ using a pre-trained model. This allows us to visualise features for the user (in the form of faces) and to collect gaze data, which in turn can be used to select relevant features through the GFE.
It is crucial that both the decoder and the GFE operate in the same feature space $\mathbb{F}$ to allow for an iterative loop of visually decoding features for users to look at, and encoding the images again using the recorded gaze information.

\paragraph{Query engine.}
\label{sec:query_engine}
For our system to maximise the information about the user's mental image, it needs to decide which images to show in each iteration.
For this, we propose a novel query engine which predicts the image features $F_{i,j+1}$ for the next iteration, given the previously shown and extracted features, $F_{i,j}$ and $F_{M,j}$:
\[
P(F_{i,j+1}|F_{i,j},F_{M,j}),\text{ with } F_{i,0} \text{ and } F_{M,0} = \text{constant} \tag{1} \label{eq:query_engine}
\]
This allows our system to dynamically show images to the user based on their previous gaze behaviour.
Thus, in each iteration, it can dynamically decide for which features to collect more information.

\paragraph{Iterative collaboration.}
Together, the GFE, decoder, and query engine form an interactive feedback loop, enabling the human and the system to collaboratively reconstruct the mental image as shown in Figure \ref{fig:overall_architecture}:
The query engine proposes a set of feature vectors about which our system wants to gain knowledge, which are visually decoded into the image space by the decoder.
Our system then shows these images to the user while their gaze fixations are being recorded.
Using the joint gaze and image information, the GFE then extracts the feature vector that is most relevant in this iteration for eventually reconstructing the mental image.
This cycle continues for $m$ iterations after which all the feature vectors extracted by the GFE are combined and visually decoded into the final mental image.

\begin{algorithm2e}
\caption{Selection layer simulating the gaze-guided feature extractor}
\label{alg:selection_layer}
\KwIn{Auxiliary and target image features $f_{I,j}$ and $f_M$. Probability distribution $P_\text{GFE}$}
\KwOut{Selected features $f_{M,j}$}
differences $\leftarrow$ abs($f_{I,j} - f_M$)\;
args $\leftarrow$ arg\_sort(differences)\;
$k \sim P_\text{GFE}$\;
selected\_arg $\leftarrow$ one\_hot(args[k])\;
$f_{M,j}$ $\leftarrow$ $(\text{selected\_arg} * f_{I,j})$\;
\end{algorithm2e}

\subsection{Simulating the gaze-guided feature extractor}
\label{sec:simulating_gfe}

Training this system end-to-end is impractical given that human gaze data for the GFE would have to be collected dynamically during training for each set of images predicted by the query engine.
A key innovation of our method is that we instead simulate the GFE in the form of a selection layer during training, and only use a pre-trained GFE model, which requires gaze data, at test time.

The GFE is simulated as a probability distribution over the possible features that it can select from the auxiliary images.
Given the user's fixation data, the GFE predicts, for each feature dimension $f \in F$, from which of the shown auxiliary images $I$ to select the value of $f$.
Using a small amount of labelled data $\{(I_i,G_i,I_M)|i=1...n\}$ we can estimate how often the pre-trained GFE selects the feature value closest to the target, the second closest, and so on. 
Specifically, we estimate $P_\text{GFE}(f_M=f_k)$, where $k$ represents the similarity rank of $f_k$ to the mental image feature $f_M$ (see Appendix \ref{gfe_eval} for details).
Using these estimated probabilities we define a selection layer in Algorithm \ref{alg:selection_layer}.

Using the target feature value $f_M$, we calculate a distance ranking of the proposed feature values $f_{I,j} = (f_{1,j},f_{2,j},...,f_{n,j})$, $f_{i,j} \in F_{i,j}$ for iteration $j$.
The layer selects the $k$-th best feature value $f_I$ by sampling $k$ from $P_\text{GFE}(f_M=f_k)$.
Instead of directly indexing the selected value from the auxiliary image, we construct a one-hot selection vector and calculate the dot product with the features of the auxiliary images $f_{I,j}$.
This way we model the selection as an external random node, allowing us to calculate gradients for each parameter in the neural network.

The GFE performs the mapping $\{(I_i,G_i)|i=1...n\} \mapsto F$, while the selection layer implements the mapping $\{F_i|i=1...n, F_M\} \mapsto F$.
This approach allows us to predict the extracted features $F$ of each iteration without gaze data but requires the target image features $F_M$ as an input.
Since the target face and the corresponding features are known at training time, this proxy mapping allows us to train GBC-MIR end-to-end.

\subsection{End-to-end Training}
\label{sec:neural_implementation}

\begin{figure}[t]
    \centering
    \includegraphics[width=0.8\linewidth]{figures/query_engine_architecture.pdf}
    \caption{Overview of our end-to-end neural implementation of the collaborative reconstruction system. Dense layers predict the auxiliary image features for the next iteration. For this, they use the input of two recurrent layers that combine information about the shown and selected features of previous iterations. A final dense output layer combines the information of all selected feature vectors to predict the mental image features.}
    \label{fig:neural_implementation}
\end{figure}

The only component that has to be learned is the query engine since pre-trained models are used for the GFE and decoder.
To train the query engine end-to-end we combine it with the selection layer defined in Algorithm \ref{alg:selection_layer} to create the GBC-MIR feedback loop.
Figure \ref{fig:neural_implementation} shows the neural network structure used to learn a set of recurrent and dense layers forming the query engine.
The network is composed of $m$ stacked iteration modules as indicated by the dotted red line in Figure \ref{fig:neural_implementation}.
Each module consists of a dense layer predicting $n$ auxiliary image feature vectors for that iteration based on the output of the previous module, which are fed into two separate network paths:
One path starts with a selection layer that simulates the pre-trained GFE and receives $n * |F|$ features as input.
It outputs the selected features $F_{j,M}$ of iteration $j$ based on Algorithm \ref{alg:selection_layer}. These features are input to a recurrent layer together with the history of features from all previous iterations $\{(F_{M,k})|k=j-1,j-2,...1\}$, combining the vector sequence into a single feature vector.
The second path starts with a dense projection layer which also receives $n * |F|$ features as an input and outputs a lower-dimensional representation of these features.
Together with the history of projected features from previous iterations, these are input into another recurrent layer combining the features.
The projection path enables the network to keep track of which features it already proposed in any prior iteration.
The outputs of each recurrent layer from the two paths, selection and projection, are concatenated and input to the next module of iteration $j+1$.
After $m$ iteration modules, the output of the last module is fed into a dense output layer which predicts the target image features.

The network is trained by minimising the mean squared error between predicted and target image features.
While the network's input is only a constant value, it receives information about the target through the selection layer in each iteration module.
At test time the selection layer is replaced with a pre-trained GFE, allowing the system to make user-specific predictions based on their gaze behaviour. 
In this case a pre-trained decoder generates the auxiliary images for each iteration given the predicted image features of the query engine. 
These images can then be shown to the user while recording their gaze behaviour.
The generated images of the first iteration are constant, as our system has no knowledge about the mental image yet.
However, subsequent predictions are made based on the previously extracted features of the GFE, which are user-specific and encode increasingly more information about their mental image.