\section{Methods}

\subsection{Simulation Environment}

We designed a long, narrow gridworld environment to study sequential prediction in navigation tasks with repeating visual patterns. The environment consists of a $48 \times 5$ grid with $6$ distinct color bands that repeat along the horizontal axis, creating a corridor-like structure. Each color band has a width of $w = \lfloor L_x / (n_{\text{colors}} \times 2) \rfloor$ where $L_x = 48$ is the grid width and $n_{\text{colors}} = 6$ is the number of distinct colors.

The agent receives egocentric observations through a $5 \times 5$ window centered on its current position. The observation space includes:
\begin{itemize}
    \item One-hot encoded color channels for each of the $6$ colors
    \item A wall channel indicating out-of-bounds areas
    \item An object channel for randomly placed objects outside the grid boundaries
\end{itemize}

The total observation dimension is $d_{\text{obs}} = 5 \times 5 \times (6 + 2) = 200$. Additionally, $10$ randomly placed objects are positioned outside the grid boundaries within a margin of $2$ cells to provide additional visual context.

\subsection{Agent Behavior}

We implemented a reactive agent that performs a random walk with wall-avoidance behavior. The agent has four possible actions: forward, left turn, right turn, and backward. The agent's behavior is characterized by:

\begin{itemize}
    \item \textbf{Wall Detection}: The agent detects walls by attempting forward movement and checking if the position changes
    \item \textbf{Wall Avoidance}: When a wall is detected, the agent turns with $90\%$ probability (left or right with equal probability)
    \item \item \textbf{Forward Movement}: When no wall is detected, the agent always moves forward
    \item \textbf{Exploration}: With $10\%$ probability, the agent performs random turns even when no wall is present
\end{itemize}

The agent's heading is represented using both one-hot encoding and sinusoidal/cosinusoidal features, providing $6$-dimensional heading information (4 one-hot + 2 sin/cos).

\subsection{Sequential Prediction Task}

We formulate the task as $k$-step sequential prediction, where the model must predict future observations given:
\begin{itemize}
    \item An initial observation $o_0$
    \item A sequence of $k$ future actions and heading features $f_{1:k} = \{a_t, h_t\}_{t=1}^k$
\end{itemize}

The model is trained to predict the corresponding sequence of future observations $o_{1:k}$ using mean squared error loss:
\begin{equation}
\mathcal{L} = \frac{1}{k} \sum_{t=1}^k \|o_t - \hat{o}_t\|_2^2
\end{equation}

where $\hat{o}_t$ is the model's prediction for observation at time $t$.

\subsection{Model Architecture}

We employ a GRU-based recurrent neural network with the following architecture:

\begin{itemize}
    \item \textbf{Observation Encoder}: Two-layer MLP with hidden dimension $h = 128$ and ReLU activation, followed by a Tanh activation
    \item \textbf{Feature Encoder}: Two-layer MLP with the same architecture for processing action-heading features
    \item \textbf{GRU}: Two-layer GRU with hidden dimension $h = 128$
    \item \textbf{Prediction Head}: Two-layer MLP with ReLU activation for generating observation predictions
\end{itemize}

The model processes the initial observation $o_0$ through the observation encoder to initialize the GRU hidden state. The sequence of action-heading features $f_{1:k}$ is then processed through the feature encoder and fed to the GRU to generate predictions for each of the $k$ future time steps.

\subsection{Training Configuration}

The model is trained using the following hyperparameters:
\begin{itemize}
    \item Training trajectories: $200$ trajectories of length $T = 40$
    \item Validation trajectories: $30$ trajectories
    \item Batch size: $128$
    \item Learning rate: $2 \times 10^{-3}$ (Adam optimizer)
    \item Gradient clipping: $1.0$
    \item Training epochs: $10$
\end{itemize}

We evaluate the model's learned representations using Principal Component Analysis (PCA) on the GRU hidden states, visualizing the $2$D projection colored by the agent's $x$-position to assess spatial encoding capabilities.

\subsection{Evaluation Protocol}

For evaluation, we generate $100$ trajectories of length $T_{\text{eval}} = 50$ and extract hidden states from the trained model. We apply PCA to reduce the $128$-dimensional hidden states to $2$D for visualization, coloring points by the agent's $x$-position to reveal spatial structure in the learned representations.

The evaluation protocol allows us to assess whether the model has learned to encode spatial information in its hidden states, which would be evidenced by clustering or smooth transitions in the PCA visualization corresponding to the agent's position along the corridor.
