\section{Related work}

\paragraph{Mental image retrieval and reconstruction.}
In mental image retrieval,
a query engine typically presents selected, existing images to a user who then manually chooses the closest fit~\citep{fang2021attribute,fang2005experiments,fang2020perception,fang2018attribute,fang2005interactive}.
Based on this choice, the system then presents a new set of images and the user input process continues iteratively until the target image has been found. 
Instead of selecting existing images from a database, target images can also be created synthetically.
In mental image reconstruction, synthetic images are refined in every round of user feedback, until the user is satisfied with the result. \citet{bontrager2018deep} integrated user feedback in such a fashion and used deep interactive evolution to reconstruct mental images, while \citet{xu2019generating} improved their method by allowing control over specific facial features through relevance feedback.
By choosing only pictures with appropriate features, the user allows these features to contribute to the next iteration of pictures. \citet{zaltron2020cg} proposed Composite Generating Generative Adversarial Networks (CG-GAN) allowing the user to traverse the latent space of a pre-trained GAN.
Users were able to select proposed faces to combine their information through a process called mutation.
However, these methods rely on explicit user feedback rather than implicit behavioural cues.

\paragraph{Gaze-based mental image retrieval and reconstruction.}
Human gaze has recently attracted increasing research interests as a promising implicit feedback modality for mental image retrieval and reconstruction.
Leading these efforts was work by ~\citet{sattar2015prediction} who used bags of visual words to address the task in an open-world setting. \citet{stauden2018visual} further improved their feature extraction component by adding a pre-trained CNN, while \citet{barz2020visual} improved the encoding of fixation sequences with a pre-trained SegNet and used a SVM for the final prediction task.
%Two important projects also modified the task itself.
In later work, \citet{sattar2020deep} went beyond predicting only the target instance by also predicting the target class and attributes, while \citet{wang2019mental} explored a setting in which gaze behaviour used for the target prediction was collected after showing a stimulus. 
Finally, \citet{strohm21_iccv} were the first to demonstrate the feasibility of reconstructing mental images using gaze behaviour only.
Their method consisted of an encoder to extract image features and activation maps as well as a scoring network to compare human gaze maps with neural activation maps to predict a relevancy score for each extracted feature.
These scores were used to combine features from multiple images into a single feature vector that was finally decoded into the mental image.
While achieving promising results, their method requires prior knowledge severely limiting the usability of their method.
