\section{Introduction}

Methods and tools to generate human faces have broad applicability in computer graphics, visual design, human-computer interaction (HCI), and beyond. 
For example, creating facial composites or photofits -- visual reconstructions of a human face from someone's mind -- is widely used in criminal investigations to reconstruct the appearance of a wanted person~\citep{frowd2004evofit,frowd2005forensically,george2008efit}.
Another popular application domain is gaming in which users frequently want to generate digital avatars that are visually similar to them as quickly and effortlessly as possible. 
Previous work has explored two main approaches to achieve this goal:
One approach involves methods to automatically convert a real face image into an avatar using image processing techniques~\citep{kim2019u,hu2017avatar}.
However, these methods require the target face to be visually available in advance, which is not always possible in particular for facial composites or photofits.  
The second and more common approach is to rely on software tools that, while allowing for fine-grained control of the generated images~\citep{schwind2017facemaker}, require a lot of tedious and time-consuming manual work, and are usually geared to a particular application domain. 

A promising alternative are computational methods that directly reconstruct mental images
from brain activity recorded while users look at carefully crafted visual stimuli
\citep{beliy2019voxels,guccluturk2017reconstructing,shen2019deep,vanrullen2019reconstructing,date2019deep,shatek2019decoding,lin2019dcnn}. 
However, measuring brain activity is impractical for most real-world systems given that
it requires expensive and special-purpose equipment as well as extensive operator training.
In contrast, human gaze can be measured using off-the-shelf hardware that has recently become both significantly more affordable and usable also by non-experts~\citep{kassner2014pupil}.
Given these advances, a recent line of work has started to investigate gaze-based mental image retrieval~\citep{wang2019mental} as well as visual
reconstruction~\citep{sattar2017predicting,sattar2020deep}.
While the former uses eye gaze features to retrieve a mental image from an existing database, gaze-based mental image reconstruction is profoundly more challenging given that the target image exists only in a user's mind. 

\citet{strohm21_iccv} recently demonstrated the first method to reconstruct a mental image solely from eye gaze fixations.
While their face reconstructions looked promising visually, their method still required the target face to be known in advance, which renders the method impractical for real applications. 
With the goal to address this fundamental limitation, we propose a novel method for gaze-based mental image reconstruction that does not require any prior knowledge about the target image. 
To achieve this goal, our iterative method implements the principle of human-AI collaboration \citep{akata2020research} (see Figure \ref{fig:overall_architecture}):
On the AI side, we propose a query engine that predicts the most relevant set of face image features for the next iteration while taking prior iterations into account.
These features are then decoded into images using a decoder network and presented to the human.
The human user looks at these images, searching for similar features to their mental image while their gaze behaviour is being recorded using an eye tracker.
Subsequently, new face image features are extracted using the gaze-guided extractor proposed by \citet{strohm21_iccv}.
These steps are repeated multiple times with gaze-guided image features being extracted in each iteration.
Finally, all extracted image features are combined into a single feature vector that is decoded into the final facial composite.

The contributions of our work are two-fold:
First, we propose a novel gaze-based collaborative method for mental image reconstruction (GBC-MIR).
In stark contrast to prior work, our method does not require prior knowledge about the target face but only a pre-trained gaze-guided image feature extractor. 
Furthermore, our method is domain independent -- while we demonstrate an application for face image reconstruction, our method can be applied for any mental image reconstruction task. 
Second, we evaluate our method in a 12-participant user study and show that it outperforms the current state of the art in reconstruction quality while, at the same time, achieving higher usability.
