\documentclass[pmlr]{jmlr}% new name PMLR (Proceedings of Machine Learning Research)

 % The following packages will be automatically loaded:
 % amsmath, amssymb, natbib, graphicx, url, algorithm2e

 %\usepackage{rotating}% for sideways figures and tables
\usepackage{longtable}% for long tables

 % The booktabs package is used by this sample document
 % (it provides \toprule, \midrule and \bottomrule).
 % Remove the next line if you don't require it.
\usepackage{booktabs}
 % The siunitx package is used by this sample document
 % to align numbers in a column by their decimal point.
 % Remove the next line if you don't require it.
% \usepackage[load-configurations=version-1]{siunitx} % newer version
 %\usepackage{siunitx}

 % The following command is just for this sample document:
\newcommand{\cs}[1]{\texttt{\char`\\#1}}

 % Define an unnumbered theorem just for this sample document:
\theorembodyfont{\upshape}
\theoremheaderfont{\scshape}
\theorempostheader{:}
\theoremsep{\newline}
\newtheorem*{note}{Note}

 % change the arguments, as appropriate, in the following:
\jmlrvolume{23}
\jmlryear{2022}
\jmlrworkshop{NeurIPS 2022 Gaze Meets ML Workshop}

\title[Short Title]{Modeling Human Eye Movements with Neural Networks in a Maze-Solving Task}

 % Use \Name{Author Name} to specify the name.

 % Spaces are used to separate forenames from the surname so that
 % the surnames can be picked up for the page header and copyright footer.
 
 % If the surname contains spaces, enclose the surname
 % in braces, e.g. \Name{John {Smith Jones}} similarly
 % if the name has a "von" part, e.g \Name{Jane {de Winter}}.
 % If the first letter in the forenames is a diacritic
 % enclose the diacritic in braces, e.g. \Name{{\'E}louise Smith}

 % *** Make sure there's no spurious space before \nametag ***

 % Two authors with the same address
 
% \author{
%   \Name{Jason Li},
%   \Name{Nicholas Watters},
%   \Name{Yingting Wang},
%   \Name{Hansem Sohn},
%   \Name{Mehrdad Jazayeri}\\\Email{jasli@mit.edu, nwatters@mit.edu, swang22@bu.edu, hansem@mit.edu, mjaz@mit.edu}
%   \addr Department of Brain and Cognitive Sciences,\\
%   McGovern Institute for Brain Research\\
%   Massachusetts Institute of Technology \\
%   Cambridge, MA 02139
% }

\author{
  \Name{Jason Li} \Email{jasli@mit.edu}\\
  \Name{Nicholas Watters} \Email{nwatters@mit.edu}\\
  \Name{Yingting Wang} \Email{swang22@bu.edu}\\
  \Name{Hansem Sohn} \Email{hansem@mit.edu}\\
  \Name{Mehrdad Jazayeri} \Email{mjaz@mit.edu}\\
  \addr Department of Brain and Cognitive Sciences,\\
  McGovern Institute for Brain Research\\
  Massachusetts Institute of Technology \\
  Cambridge, MA 02139
}


%   \author{\Name{Author Name1\nametag{\thanks{with a note}}} \Email{abc@sample.com}\and
%   \Name{Author Name2} \Email{xyz@sample.com}\\
%   \addr Address}

 % Three or more authors with the same address:
 % \author{\Name{Author Name1} \Email{an1@sample.com}\\
 %  \Name{Author Name2} \Email{an2@sample.com}\\
 %  \Name{Author Name3} \Email{an3@sample.com}\\
 %  \Name{Author Name4} \Email{an4@sample.com}\\
 %  \Name{Author Name5} \Email{an5@sample.com}\\
 %  \Name{Author Name6} \Email{an6@sample.com}\\
 %  \Name{Author Name7} \Email{an7@sample.com}\\
 %  \Name{Author Name8} \Email{an8@sample.com}\\
 %  \Name{Author Name9} \Email{an9@sample.com}\\
 %  \Name{Author Name10} \Email{an10@sample.com}\\
 %  \Name{Author Name11} \Email{an11@sample.com}\\
 %  \Name{Author Name12} \Email{an12@sample.com}\\
 %  \Name{Author Name13} \Email{an13@sample.com}\\
 %  \Name{Author Name14} \Email{an14@sample.com}\\
 %  \addr Address}


 % Authors with different addresses:
 % \author{\Name{Author Name1} \Email{abc@sample.com}\\
 % \addr Address 1
 % \AND
 % \Name{Author Name2} \Email{xyz@sample.com}\\
 % \addr Address 2
 %}

% \editor{Editor's name}
 % \editors{List of editors' names}

\begin{document}

\maketitle

\begin{abstract}
From smoothly pursuing moving objects to rapidly shifting gazes during visual search, humans employ a wide variety of eye movement strategies in different contexts.
While eye movements provide a rich window into mental processes, building generative models of eye movements is notoriously difficult, and to date the computational objectives guiding eye movements remain largely a mystery.
In this work, we tackled these problems in the context of a canonical spatial planning task, maze-solving.
We collected eye movement data from human subjects and built deep generative models of eye movements using a novel differentiable architecture for gaze fixations and gaze shifts.
We found that human eye movements are best predicted by a model that is optimized not to perform the task as efficiently as possible but instead to run an internal simulation of an object traversing the maze.
This not only provides a generative model of eye movements in this task but also suggests a computational theory for how humans solve the task, namely that humans use mental simulation.
\end{abstract}
\begin{keywords}
recurrent neural network, saccade, mental simulation, psychophysics, maze, gaze, fovea
\end{keywords}

\section{Introduction}

Throughout the history of cognitive science, eye movements have been appreciated as a window into the workings of the mind and brain \citep{Helmholtz1924, Liversedge2000,Hayhoe2005-jy,Konig2016-za}.
However, human eye movements are so rich and varied that characterizing them is difficult even in simple tasks \citep{Land2000-wn, Beller2022, Gerstenberg2017}.
Building generative models of eye movements is an even greater challenge \citep{chen2017eccentricity, zoran2020towards}, and to date most such work focuses only on free-viewing or visual search contexts, not complex cognitive tasks \citep{kummerer_2021, Zelinsky2020-iz}.

To tackle the problem of modeling task-driven saccade sequences, we designed a maze-solving task.
In this task, subjects must find the exit location of a path in a maze given a starting point of the path (Figure \ref{task}). This task provides an ideal platform for building generative models of eye movements because it offers a near-limitless variety of spatial plans, yet eye movements are largely consistent across humans \citep{crowe2000mental}, making them tractable to model.
Furthermore, this task may be solved using mental simulation of an object traveling through the maze, so allows us to test mental simulation as a computational theory guiding eye movements
\citep{Gerstenberg2017, ullman2017mind, ahuja2019behavioral, Rajalingham2021}.

In this work, we develop a novel general-purpose method for incorporating features of human vision such as eccentricity-dependent visual acuity and discrete saccades into a task-optimized, end-to-end differentiable recurrent network. Using this method, we construct a space of models with and without mental simulation constraints, and train these on the maze-solving task.
We collect eye movement data from human subjects playing the task and compare this data to eye movements generated by the models to test multiple hypotheses for how humans solve the task.


\section{Related Work}

Building generative models of human eye movements has been an active area of research in psychology for decades \citep{Zelinsky2020-iz, kummerer_2021, wedel2022}.
One approach to modeling eye movements is to hard-code the heuristics of eye movements without employing task-driven learning.
This approach has seen some success in free viewing or visual search contexts \citep{itti1998, zelinsky2008, zhang2005, adeli2017, zelinsky2013, Eckstein2011-po}.
However, our model differs from those approaches in that (i) it learns a general policy for generating saccadic eye movements, so can in principle be applied to any task, and (ii) is a neural network, hence can more easily serve as a mechanistic model of the brain at an implementation level.

More recently, deep learning approaches to generate sequences of eye movements have been developed, for example, by fitting a model directly to human data \citep{assens2017, sun2019, xia2019, yang2020, kummerer_2022}.
While these can provide impressive fits to human data, in this work our goal differs in that we aim to build a model that shows emergent human-like eye movements through task optimization, without explicitly fitting to human eye movement data.
Other deep learning models employ sequential attention in general-purpose task-trained networks, but have not been hypothesized as models of human eye movements or tested against human eye data \citep{gregor2015, eslami_2016, Adeli2022}.
In contrast, we develop our models to generate human-like eye movements, collect (and open-source) human eye data, and test our models against this data.




\section{Methods}\label{section:methods}

All of our data and code can be found at \href{https://github.com/jazlab/Maze_Task_2022}{https://github.com/jazlab/Maze\_Task\_2022}, along with documentation and instructions for replicating our results.

\subsection{Task and maze dataset}

In the maze-solving task, a subject is presented with a square maze and an entrance point somewhere on its perimeter.
This entrance point is one end of an unbroken, non-branching path through the maze, which exits the maze at some other uniformly sampled perimeter point.
The subject is tasked to find this exit point (see Appendix \ref{section:appendix_task_instruction} for task instructions for human subjects).
We trained models on random mazes generated online.
We also created a test set for human and model comparison comprising 200 unique procedurally generated mazes.
See Appendix \ref{section:appendix_maze_generation} for details about the maze generation algorithm.

\begin{figure}[!ht]
    \centering
    \includegraphics[width=0.9\linewidth]{task.PNG}
    \caption{Screenshots of the task presented to human subjects. At the start of a trial, the subject fixates at a maze entrance point indicated by a white cross (before maze presentation) or a green dot (after maze presentation).  
    Subjects must locate the correct exit point and press a button once they have fixated their eye gaze on the exit point. At the button press, the true exit position is indicated by another green dot (lower-left side in "Correct") and a red dot ("Gaze") shows the reported position. The rightmost panel ("Error") illustrates an error trial when subjects incorrectly identified the exit.}
    \label{task}
\end{figure}

\subsection{Human data collection}\label{section:human_data_collection}

Fourteen human subjects volunteered to participate in the experiments after providing informed consent. All participants (age: 18 to 65 years old, eight female and six male) had normal or corrected-to-normal vision with no history of neurological or psychiatric disorders. All experiments were approved by the Committee on the Use of Humans as Experimental Subjects at the Massachusetts Institute of Technology.

Subjects were seated in front of a LCD monitor (width: 53 cm, height: 30 cm; Acer R240HY) at a distance of 66 cm. Each session started with a procedure for calibrating eye positions using an optical eye tracker (EyeLink 1000 Plus, SR Research). Eye position was monocularly sampled with 1 ms resolution while a chin rest stabilized head of the participants. We monitored quality of the eye signal during the experiment and repeated the eye calibration in the middle of the session if needed. After receiving instruction for the maze-solving task and several practice trials, data collection began. In each trial, we randomly selected a maze from a predetermined test set. The test set included two repetitions of each unique maze. Each participants completed one 1-hour session, which consisted of approximately 400 trials. Maze stimuli (14 degree of visual angle) and behavioral contingencies were controlled by an open-source software (MWorks; mworks-project.org/) and Modular Object-Oriented Games (MOOG) library \citep{moog2021}.

While solving the maze, humans' eye movements were exclusively saccadic. Consequently, we extracted saccades from the eye position data by first filtering with a 4 ms Gaussian kernel and then thresholding eye velocity at 50 degrees of the visual angle per second. To prevent measurement noise of the eye position from dominating metrics that we use to compare humans with models, we recalibrated the raw eye position data for each trial. To do so, we estimated a calibration error vector between the actual fixation point and the gaze fixation point, and subtracted that vector from eye positions throughout the trial. This recalibration did not affect our main findings.

\subsection{Gaze recurrent neural network (RNN) models}\label{section:gazeRNNs}

We developed a task-optimized recurrent convolutional neural network model that is equipped with a foveal module. The foveal module allows the model to receive high acuity visual information near the fovea and low acuity information in the periphery, like the human eye.
The recurrent model is also able to control the position of its fovea, allowing it to make eye movements.
The model is end-to-end differentiable, so can be trained via backpropagation, through which an eye movement policy emerges from task-optimization.

We modeled the fovea by applying a circular exponential mask $e^{-d/\tau}$ to the visual input, where $d$ is distance to the center of fovea and $\tau$ is a scaling parameter. 
We chose 5 pixels as a value of $\tau$ (within a maze of 39 pixels), which is consistent with reported human peripheral visibility maps \citep{Najemnik2005-he,Strasburger2011-bv}. See Appendix \ref{section:tau_sweeping} for results with varying choices of $\tau$.

After applying the mask, we add independent noise to each pixel, sampled from a normal distribution $\mathcal{N}(\mu = 0,\,\sigma^2 = 0.05)$.
This noise washes out faint information in the peripheral tail of the foveal mask, analogous to the decreased peripheral photoreceptor density in the human retina. Note that this noise is essential to prevent the network from exploiting the peripheral information available when only the mask is used. Figure \ref{fovea} shows a diagram of the model's fovea mechanism. This foveal module is general-purpose, and in theory can be incorporated into any RNN that takes visual input.
To our knowledge, this method is novel in the field.

\begin{figure}[!ht]
\floatconts
    {fig:subfigex2}
    {\caption{Gaze RNN model diagram.}}
    {
        \subfigure[Differentiable fovea mechanism. Input is eye position and rendered maze; output is noisy masked maze, the visual input for the next step of the gaze RNN.][b]{
            \label{fovea}
            \includegraphics[width=\textwidth]{fovea.PNG}
        }
        \subfigure[Gaze RNN model. Memory CNN is a 3-layer Convolutional Neural Network. Saccade CNN is a strided 3-layer CNN with two 3-layer MLP (Multi-Layer Perceptron) heads for Cartesian eye position and ball position vectors. This architecture can be unrolled through time for an arbitrary number of steps or saccades.][b]{
            \label{model}
            \includegraphics[width=\textwidth]{model.PNG}
        }
    }
\end{figure}

We implement three specific convolutional RNN models, \textsc{exit}, \textsc{sim}, and \textsc{hybrid}. They all receive visual input via this foveal module, have the same internal architecture, and can generate two outputs, the Cartesian coordinate for the eye position (i.e., the center of fovea) and the Cartesian coordinate of the next ball position (Figure \ref{model}). The three models differ only in their objective function.


\subsubsection{Exit}

The \textsc{exit} model is trained with a Mean Squared Error (MSE) loss between the eye position at each step ($\hat{p}_i^\text{eye}$) and the maze exit point ($p^\text{exit}$), across all $n$ steps.
There is no loss on the model's ball position output.
This model represents an optimal exit-finding strategy where the model moves its eyes to the exit in as few saccades as possible. 
We minimize

$$L_{\text{\sc{exit}}} = \frac1n \sum_{i=1}^{n} (\hat{p}_i^\text{eye} - p^\text{exit})^2$$

\subsubsection{Simulation}

The \textsc{sim} model aims to predict the position of an imaginary ball that moves from start to exit at constant velocity. To do so, we optimize this model with a ``simulation loss,'' which is formulated as the MSE between each predicted ball position ($\hat{p}_i^\text{ball}$) and the actual position of an imaginary ball traveling at 10 pixels per time step ($p_i^{\text{ball}}$). This ball speed was calculated from the human eye movement data as $l/n$ averaged over all trials, where $l$ is total maze length and $n$ is number of saccades. 
In the \textsc{sim} model, eye position is not explicitly constrained, but still plays a critical role in advancing the model's visual field. 
We minimize

$$L_{\text{\sc{sim}}} = \frac1n \sum_{i=1}^{n} (\hat{p}_i^\text{ball} - p_i^{\text{ball}})^2$$

\subsubsection{Hybrid}

The \textsc{hybrid} model is trained with a weighted sum of the loss functions for the \textsc{exit} and \textsc{sim} models. 
The model's eye position must reach the exit quickly \textit{and} allow the model's predicted ball positions to match the position of the imaginary ball in the maze. 
The ratio of \textsc{sim} to \textsc{exit} loss weight is controlled by a coefficient $\beta = \frac13$, chosen so that the two loss terms have similar magnitudes in a fully trained model. 
We minimize

$$L_{\text{\sc{hybrid}}} = \beta \cdot L_{\text{\sc{exit}}} + (1-\beta) \cdot L_{\text{\sc{sim}}}$$

All models were trained on an NVIDIA GeForce GTX 1080 TI GPU with 4 GB RAM per model and a total compute time of about 100 hours.
Given computational resource limitations, we trained one instance of each model, though see Appendix \ref{section:tau_sweeping} for results from additional instances in the context of hyperparameter sweeps.
Models were implemented in PyTorch \citep{pytorch_2019} and trained with 8 recurrent steps per maze, batch size 16, through 1.8 million iterations using Adam optimizer \citep{kingma2014adam} with learning rate 0.0003. 
This was sufficient for each model's loss to converge to a stable asymptote. 
The training dataset was generated online using a custom procedural maze generator with the same statistics (though not the same samples) as the test set (see Appendix \ref{section:appendix_maze_generation} for maze-generation details).

\subsection{Baseline model}

As a standard of comparison for the gaze RNNs, we created a baseline model designed to match high-level human saccade statistics. 
This model iteratively constructs saccade paths where the amplitude and angle of each saccade, as well as the total number of saccades in the path, are sampled from the corresponding distributions found in our human eye movement data (Figure \ref{saccades}). 
For each trial, we construct 2,000 saccade paths and select the path whose final fixation point is closest to the correct maze exit. 
For our maze test set, this is sufficient to guarantee that the final fixation point falls within 5 pixels of the maze exit greater than 90\% of the time. 

\subsection{Metrics for comparing eye movement data}\label{section:metric}

To quantify these results, we use two metrics for comparing eye movement paths:
\begin{itemize}
    \item \textbf{Nearest neighbors distance} is computed as the mean of the nearest point in path $A$ to each point $p_B$ in path $B$ and the nearest point in path $B$ to each point $p_A$ in path $A$:
    \begin{align}
        \mathcal{L}_{NN} &= \frac{1}{2} \cdot \left(
            \mathbb{E}_{p_A \in A}\left[\min_{p_B \in B} ||p_B - p_A||_2\right] + 
            \mathbb{E}_{p_B \in B}\left[\min_{p_A \in A} ||p_A - p_B||_2\right]
        \right)\notag
    \end{align}
    \item \textbf{Area between paths} is computed as the total area of the polygon(s) formed between paths $A$ and $B$. See Appendix \ref{section:appendix_area_between_paths} for details.
\end{itemize}

For both of these metrics, a lower value implies the paths $A$ and $B$ are more similar.

\section{Results}

\subsection{Saccade path similarity}

Figure \ref{examples} shows the behavior of two representative human subjects, three RNN models, and the baseline model for three example mazes in the test set. Evidently, the saccadic eye movements in humans and models roughly follow the correct path through the maze and successfully find the exit point.
Humans display a tendency to cut corners of the maze path. 
Qualitatively, out of the gaze RNNs, the \textsc{exit} model seems most dissimilar to humans as it often makes large saccades that are not present in human eye movements. 
The \textsc{sim} and \textsc{hybrid} models make more uniform saccades that appear to better match human saccades. 
The baseline model tends to generate erratic saccade paths that, by construction, terminate near the exit point but do not resemble human saccade paths.

\begin{figure}[!ht]
    \begin{center}
        \includegraphics[width=0.9\linewidth]{color_examples.PNG}
    \end{center}
    \caption{Human and model behaviors on three sample mazes.} 
    \label{examples}
\end{figure}

\begin{figure}[!ht]
    \begin{center}
        \includegraphics[width=0.8\linewidth]{metrics.PNG}
    \end{center}
    \caption{
    Metric scores between model and human eye paths computed on the test set.
    To compute these given a model and metric, for each test maze we compute the metric score on each [model gaze path, subject gaze path] pair using all human subject trials and 2 runs of a trained instance of the model on the test maze.
    We then average all of these scores to obtain a total model-human similarity.
    Error bars are 95\% confidence intervals.
    Quantitatively, nearest-neighbors and area-between-paths scores are: \textsc{EXIT}: [$0.987 \pm 0.005$, $1.135 \pm 0.012$]; \textsc{SIM}: [$0.747 \pm 0.007$, $0.679 \pm 0.012$]; \textsc{HYBRID}: [$0.848 \pm 0.006$, $0.811 \pm 0.013$]; between-human mean: [$0.762$, $0.750$].
    Note that the \textsc{sim} models achieves better average similarity to humans than between-human similarity, which is not impossible and implies that the model has lower variance than the inter-subject variance.}
    \label{metrics}
\end{figure}

Quantitatively, the \textsc{sim} model exhibits the lowest mean model-human distances under both distance metrics (Figure \ref{metrics}). 
On the other hand, the \textsc{exit} model produces the least human-like eye movement paths, comparable to those produced by the baseline model. 
The \textsc{hybrid} model's metric scores fall between those of the other two gaze RNN models. 
Therefore, the \textsc{sim} model is the most human-like of our three generative models. 

\subsection{Saccade vector similarity}

\begin{figure}[!ht]
    \begin{center}
        \includegraphics[width=\linewidth]{saccade_dists.png}
    \end{center}
    \caption{
    Saccade vector distributions for humans and models in visual angle.
    For models, each plot shows 1,000 randomly sampled saccade vectors from a dataset of 2 model evaluations on each test maze.
    The human plot shows 1,000 randomly sampled saccades from all human data aggregated across subjects.
    }
    \label{saccades}
\end{figure}

In addition to comparing human and model saccade paths on a trial-by-trial basis, we also compared their aggregate saccade vector distributions. 
Figure \ref{saccades} shows a sample of the saccades executed by humans and each model on the test set, with the tail of every saccade vector centered and the head plotted as a point. 
In the human distribution, saccade angle is relatively uniform and most saccade amplitudes are contained within a radius of 3 degrees. 
This is also true for the baseline model's distribution, which was sampled from the human distribution. 
Consistent with the sample shown in Figure \ref{examples}, the \textsc{exit} model's distribution shows many high-amplitude saccades, reflecting the large, erratic saccades that model tends to favor. 
The \textsc{sim} model's distribution is nearly bounded by a square, which results from the right-angle maze geometry. 
Saccade amplitude attains its maximum at the four cardinal directions because it is in those directions that a constant-velocity ball can travel farthest through the maze.

These results suggest that although the \textsc{sim} model's eye movement paths most closely resemble those of humans, it is not a complete model of humans' eye movement strategy when solving this task. 


\section{Limitations and Future Work}\label{section:limitations}

One limitation of our work is that the simulation model shows a more constrained saccade vector distribution than humans (Figure \ref{saccades}), which may be a consequence of its tendency to follow the path more faithfully than humans (Figure \ref{examples}).
This close path-following with its fovea is an emergent property of the model: The simulation loss was applied only to the model's ``ball position'' output, not its ``eye position'' output.
Nonetheless, similarity to human eye movements may be improved with variants of the \textsc{SIM} model, such as by (i) varying the speed of the ball simulation, or (ii) introduce a non-constant simulation speed, either learned by the model or computed based on predetermined heuristics (such as speeding up on long straightaways and slowing down near corners).
Furthermore, exploration of the impact of simulation speed on eye movements in the model may lead to predictions about simulation speed in humans.
Future work is needed to explore these possibilities.

A second limitation of our work is that 
it is difficult to characterize the biological plausibility of our foveal module's hyperparameters. 
For instance, the human visibility varies greatly based on context-dependent factors like scene clutter, crowding, and luminance \citep{pelli2007crowding, Levi2008-tw,Strasburger2011-bv}, so it is difficult to ascertain how our scaling parameter $\tau$ compares to that of the human fovea. 
Nevertheless, we believe that our choice of hyperparameters falls within reasonable bounds and when we varied the hyperparameter $\tau$, the simulation model is still the most similar to the human data (Appendix \ref{section:tau_sweeping}).

A third limitation of our models is that they make one saccade for each RNN timestep.
This prevents them from capturing temporal aspects of human eye movements, such as the duration of fixations between saccades, reaction times, etc.
Future work may address this by allowing the model to control fixation durations, which could emerge from task training if longer fixations reduce perceptual noise.

Finally, the mazes used to train and test the gaze RNN models had paths and walls of the same width, while those presented to human subjects playing the task had much thinner walls than paths. 
The  resolution of the gaze RNN mazes was limited by computational capacities. 
This may have resulted in humans producing slightly different eye movement trajectories. 

\section{Conclusion}

We find that in the maze-solving task, a gaze RNN trained to run an internal simulation of a ball moving through a maze generates eye movements more similar to those of human subjects than a model trained only to solve the task as optimally as possible.
This suggests that humans may employ a similar mental simulation when performing this maze-solving task.
Further work is needed to explore the relationship between the biological plausibility of the model fovea hyparparameters and model behavior.
Future work also includes incorporating our differential fovea method into RNNs trained on other tasks to study the principles of human eye movements in domains beyond maze-solving.

\acks{
J.L. is supported by the MIT Quest for Intelligence.
N.W. is supported by the National Science Foundation.
Y.W. is supported by the Simons Foundation.
H.S. is supported by a NARSAD young investigator grant from the Brain \& Behavior Research Foundation.
M.J. is supported by the Simons Foundation, the McKnight Foundation, and the McGovern Institute.
}

\bibliography{bibliography}

\appendix

\section{Task instructions for human subjects}\label{section:appendix_task_instruction}

The goal of this “maze-solving” task is to identify the exit of the maze given an initial entry position (indicated by a green ball). At the beginning of a trial, you will see a white cross, which you are asked to look at. The position of the cross corresponds to the initial entry position.

After a random delay, the maze and the green ball will appear on the screen. From the entry where the green ball sits in, there will be one continuous path until it hits one of the 4 boundaries of the maze.

After you identify the exit, fixate your eye gaze at the exit and then press the left-arrow key to report the exit location (i.e., where you are looking when pressing the key). As a feedback, the green ball will be revealed at the correct location where it should exit the maze. Your reported exit location will be shown with a red ball. Therefore, if both green and red ball are close or right next to each other, that means your response was correct.

Note that we will record your eye gaze throughout the trial and so do your best not to move your head during the experiment.

Every 50 trials, a gray screen will appear and you are given a break as long as you want. When you are ready to resume the task, press the left-arrow key again. In total, you will complete 400 trials.


\section{Maze Generation}\label{section:appendix_maze_generation}

We procedurally generated maze via a simple layering procedure.
First, we implemented a path-generation algorithm, which sampled a random path within the $20 \times 20$ maze grid by picking a random starting edgepoint and taking a random walk with turn probability $0.2$ and minimum inter-turn distance $3$ until reaching an edgepoint.
Then, to generate a maze we layered such randomly generated paths with occlusion until every grid location in the maze was covered by some path.

The code for this algorithm can be found in our open-sourced repo:

\href{https://github.com/jazlab/Maze_Task_2022}{https://github.com/jazlab/Maze\_Task\_2022}.

\section{Fovea Size Sweeping}\label{section:tau_sweeping}

\begin{table}[htbp]
\floatconts
 {tab:subtabex}
 {\label{tau_sweep}}
 {\caption{Similarity of gaze RNNs to humans across three values of $\tau$.}}
 {
   \subtable[Nearest neighbors]{
   \begin{tabular}{@{}llll@{}}
    \toprule
    $\tau$ (pixels) & EXIT & SIM & HYBRID \\ \midrule
    8            &   5.04   &  3.10   &   3.71     \\
    5            &   4.65   &  3.11   &   3.85     \\
    3.33         &   4.36   &  3.14   &   3.15     \\ \bottomrule
    \end{tabular}
    }\qquad
   \subtable[Area between paths]{
    \begin{tabular}{@{}llll@{}}
    \toprule
    $\tau$ (pixels) & EXIT & SIM & HYBRID \\ \midrule
    8            &  294.3    &  134.4   &   202.7     \\
    5            &  270.0    &  137.9   &   191.7     \\
    3.33         &  242.8    &  139.2   &   142.6     \\ \bottomrule
    \end{tabular}
   }
 }
\end{table}


\section{Area between paths metric}\label{section:appendix_area_between_paths}

The area-between-paths metric measures similarity of two paths in space.
It is computed as the total plane area of all polygon(s) formed between the two paths.
See Figure \ref{fig:area_between_paths} for an illustration.

\begin{figure}[!ht]
    \begin{center}
        \includegraphics[width=0.4\linewidth]{area_between_paths.jpg}
    \end{center}
    \caption{Illustration of area between paths. Given paths A and B, the area between them is shaded gray.}
    \label{fig:area_between_paths}
\end{figure}


\end{document}
