\section{Introduction}
When we as humans perceive a scene, our eyes constantly move due to their relatively small area of sharp vision, the fovea \cite{holmqvist2011eye}. This restriction also applies when we perceive a virtual environment (\emph{VE}) through a head-mounted display (\emph{HMD}). In both cases, it is possible to determine the different eye-movements by taking their inherent properties, such as velocity and acceleration, into account and classify them into their respective class \cite{komogortsev2013automated,andersson2017one,startsev20191d,zemblys2019gazenet,agtzidis2016smooth,salvucci2000identifying,dar2021remodnav}.
However, such a classification can only be performed after capturing the sample. This, makes a real-time utilization of gaze events or blinks challenging due to the low update rates and high latencies found in commercial head-mounted displays or wearable eye-trackers \cite{stein2021comparison,langbehn2018blink}. This is especially true for saccades, as they are temporarily short, fast eye-movements in the range of 30-80 ms \cite{holmqvist2011eye}, where wearable eye-trackers often just have a few samples to classify them correctly.

% Was ist das Problem und warum ist es relevant
Nonetheless, knowing when a saccade event occurs would benefit several virtual reality~(\emph{VR}) applications, such as %gaze shifts estimation in
gaze forecasting \cite{hu2020dgaze, hu2021fixationnet}, blink or saccade detection for redirected walking \cite{langbehn2018blink,sun2018towards}, gaze contingent rendering \cite{arabadzhiyska2017saccade} or gaze-based interaction \cite{david2021towards}.
%Here, knowing the time-to-saccade beforehand can be helpful to improve the forementioned algorithms, for example by forcing a gaze shift at the time of saccades when forecasting gaze points or rotating the virtual space in case of redirected walking at the exact time point of the saccade.
%This may result in an event being classified after it already happened, leading previous approaches to rely on long saccades \cite{sun2018towards} or intentional blinking \cite{langbehn2018blink}.
Furthermore, the prediction of fixation durations is also important in other areas outside \emph{VR}, with one example being scan path prediction which try to predict fixation durations along with the sequence of fixation points on a visual stimulus \cite{yang2020predicting}. To use these gaze events, previous applications often mitigate the latency through unnatural actions, such as intentional blinking \cite{langbehn2018blink} or long saccade durations \cite{sun2018towards}.\\

% Was ist time-to-saccade prediction
A different approach was recently proposed by \citet{rolff2022saccade}. They redefined the problem of gaze classification as a recurrent time-to-event prediction of saccade events, predicting the time it takes until a saccade occurs. However, this approach is fairly general and can also be applied to other gaze events, such as fixations or blinks.
In contrast to classical gaze classification approaches, this redefinition of gaze event classification as a recurrent time-to-event problem allows estimating the remaining time for each input sample of an eye-tracker. This provides information on how long it will take until the specified event will occur. This is desirable, as it is not essential if the class for each time-step is known, but rather when its class will change. In contrast to classical gaze classification methods, the redefinition also allows to account for the latency of eye-trackers found in commercial head-mounted displays or wearable eye-trackers.
To evaluate their approach, \citet{rolff2022saccade} utilize the mean absolute error (\emph{mae}) on a set of randomly sampled time-to-event values to evaluate how well their method performs for time-to-saccade prediction.
%They especially choose \emph{mae} instead of different time-to-event metrics, as they are concerned with the exact time point the event is going to happen.

In this paper, we define a more fitting sampling strategy than random sampling. This allows to adapt the previously used error metric to be more suited to the actual problem of time-to-saccade prediction. %For this, we will take the physical limitations of an eye-tracker and the possibility to utilize past information into account.
We will explore how well these metrics can be utilized to understand the prediction and provide a different evaluation method.\\

\noindent To summarize, our work proposes the following contributions:
\begin{itemize}
    \item A different sampling strategy for time-to-saccade data that takes the sequential information of the time-to-event of a gaze event into account. % to bin them into the same training, test, or validation set. %This strategy utilizes information of classical eye-movement classifiers to refine the time-to-saccade prediction.
    \item Define novel error metrics using the previously defined sampling strategy, enabling a more interpretable result to infer the predictive performance of a time-to-saccade predictor.
\end{itemize}

\section{Related Work}
% Survival analysis metrics
A commonly utilized metric for time-to-event problems is Harrell's concordance index \mbox{(\emph{c-index})} \cite{harrell1982evaluating}. The \emph{c-index} measures the correlation between the predicted risk-score and the observed time-to-event. Hence, a higher risk value correlates with a shorter time-to-event. However, it has also been shown that the \emph{c-index} is biased if the test set contains a high number of censored samples \cite{uno2011statistics}, leading to an alternative definition by \citet{uno2011statistics}. Another metrics commonly used is the brier-score \cite{brier1950verification}. Its definition is equivalent to the mean square error (\emph{mse}) for probabilistic predictions of binary events, hence, requiring a probabilistic prediction from the employed model. It has also been re-defined to allow censored data \cite{graf1999assessment}.

% Time series prediction metrics
Besides survival-analysis related metrics, there are multiple metrics for time-series forecasting using the real values of the prediction. Commonly used metrics are the mean-absolute, mean-square, or (normalized) root-mean-square error \cite{diebold1998elements,hyndman2006another}. Variations of those metrics have been proposed, such as the (symmetric) mean absolute percentage error. Most of the listed error metrics, assume that an overestimation should be penalized equally. This, however, might not always be the case, thus requiring an asymmetric error function such as the asymmetric mean \cite{diebold1998elements} or the linex error \cite{diebold1998elements,varian1975bayesian}.

% In addition, the time-series metrics listed so far often just evaluate the predictive accuracy of a model without taking the recall or precision into account. 


\section{Methodology \& Metrics}
\label{sec:Methodology}
\begin{figure}[!tp]
    \centering
    \includegraphics[width=\textwidth]{images/Splitting-Policy.png}
    \caption{Illustration of the sampling strategy, splitting the data at the occurrence of an event into multiple sequences (red). Each sequence then contains multiple samples and is placed into the respective dataset. It also depicts an example for a prediction with overshot (green) and a prediction with undershot (blue).}
    \label{fig:Time-to-saccade prediction} 
\end{figure}
For our experimental setup, we follow, if not noted otherwise,  \citet{rolff2022saccade}, allowing for comparability between both approaches. First, we would like to highlight a disadvantage of their evaluation, the random sampling. This does not take the temporal property of the gaze data into account, as the samples of the same time-to-saccade sequence might have been selected for different datasets. 
%For example, a sample with a high time-to-event value at the start of the sequence could have been selected for the training set, while the sample at the end is in the test set.
As a result, it is impossible to evaluate properties, like if the consistency of the prediction % over the sequence
or the error of the overall sequence, without the predictor having seen part of the data. This makes it challenging to interpret the reported error metrics, as it is not clear how the predictor behaves over time.

Here, we %change the methodology, by
introduce a new sampling strategy that keeps samples of the same time-to-saccade sequence in the same dataset. As illustrated in Fig.~\ref{fig:Time-to-saccade prediction}, we construct these sequences, by splitting the gaze signal exactly if an event happens, instead of randomly chosen samples. % for the problem at hand.%We split the gaze signal exactly if an event happens, instead of having randomly chosen individual samples.
This results in sequences containing multiple individual samples as data points.
%It also allows deriving some properties. 
Furthermore, the first time-to-event value in the sequence is always equal to the duration of the whole sequence.
%, and the last time-to-event value is always one step before the desired event.
Another advantage is that the time-to-event of a sequence is always strictly monotonically decreasing, with the rate depended on the frequency of the eye-tracker. With the update frequency $f_i > 0$ of the eye-tracker at step $i$, each time-to-saccade value ($\text{tts}$) can then be calculated through: $\text{tts}_{i+1} = \text{tts}_i - f_i^{-1}$.
%As an example, if the update rate of the eye-tracker is 100~Hz, then each time-to-saccade value at step $i$ can be calculated through: $\text{time-to-saccade}_{i+1} = \text{tte}_i - 0.01\text{s}$.

Using these observations, it allows us to explore additional error metrics that account for the temporal properties of a time-to-saccade sequence and which take and information of eye-movement classifiers into account. %While the earlier listed metrics for time-to-event prediction are helpful in their indented space, 
As Time-to-saccade predictions are rarely right censored, as they are repeatable events that happen every 300 to 2500~ms. Therefore, same as \citet{rolff2022saccade} we would like to advise against the usage of earlier listed time-to-event metrics for the evaluation of predictions, even under the new sampling strategy. 
%This is due to the capturing setup and the observation that gaze events are repeatable events that happen every 300 to 2500~ms.
As a result, the only right censored sequences are at the end of an eye-tracking session, often corresponding to a small portion of the dataset.
%Thus, depending on the length of data capture, often only corresponding to a small proportion of the captured dataset, requiring only the non-censored time-to-event metrics.
Moreover, the task of time-to-saccade prediction requires predicting the time-to-event as accurate as possible, other metrics such as the \emph{c-index} do not provide helpful information on their accuracy.
Here, it is better to utilize metrics for time-series forecasting, as they are concerned with the difference between the actual time-to-event and the predicted time. However, as these are fairly general and do not allow insight into a time-to-saccade predictor model, we propose the additional metrics:

\paragraph{Consistency:} To measure how consistent the model is in its prediction, we define consistency of a sequence $j$ with length $l$ as the relative difference \cite{diebold1998elements} between the current and the next prediction. Ideally, this change should be equal to the frequency of the eye-tracker, due to the definition of time-to-saccade. Hence, we can define \emph{consistency} as:
\[
    c_j = \sum_{i=0}^{l-1} \frac{\left|\left|p_{i+1}-p_i\right| - f_{i+1}^{-1}\right|}{|f_{i+1}^{-1}|}.
\]
As this gets evaluated over each sequence, we can derive the mean %, max, and min
consistency of a dataset through the arithmetic mean.

\paragraph{Average overshot and undershot rate:} Overshot and undershot measure if the model generally tends to predict durations that are too short (\emph{undershot}) or too long (\emph{overshot}). For the non-temporal time-to-saccade problem of a sample $j$ with time-to-saccade duration $d_j$ and predicted duration $p_j$, %of the time-to-saccade problem,
an overshot happens if $d_j - p_j < 0$, and undershot if $d_j - p_j > 0$. % can be defined as,
%\begin{align*}
%    \text{undershot}_j &= d_j - p_j \hspace{0.5cm} \text{if } d_j - p_j > 0,\\
%    \text{overshot}_j  &= p_j - d_j \hspace{0.5cm} \text{if } d_j - p_j < 0,
%\end{align*}%is equal to \mbox{$\max(p'-d', 0)$} and the undershot to \mbox{$\max(d' - p', 0)$}, 
%comparing the actual time-to-saccade $d_j$ with the predicted one $p_j$.
%However, in our case this is not possible, since the model can frequently change its prediction due to its recurrent nature. Therefore, 
%In case of a random predictor, it might happen, that it fairly regularly predicts an event in the next step, hence, resulting in multiple undershoots. In general,
%We would argue that an undershot is far more problematic for downstream methods than an overshot, as an overshot can typically be corrected for by utilizing the newly sampled data from the eye-tracker.
This calculation is not possible with a recurrent time-to-saccade prediction, as the predictor outputs an estimation $p_{j_i}$ for each step $i$.
Therefore, we calculate the average time-to-saccade~$p_j$ using the arithmetic mean
%\mbox{$p_j = \frac{1}{l_j}\sum_i^{l_j} p_{j_i}$}
for the estimation of over- and undershot of the sequence~$j$. %with length $l_j$.
%Here, we use the observation that the first sample of the sequence is equal to the duration $d$ of the time-to-saccade.
%While this is not optimal as the undershot might be at the start of a time-to-saccade prediction and therefore not be at the final prediction shortly before the event, we assume this to be a reasonable approximation for a general overview. 
While this is not optimal as the prediction might over- or undershoot with time, we assume this to be a reasonable approximation for a general overview.
This allows us to define the average overshot and undershot rate for a set of $n$ sequences as:
\begin{align*}
    \text{avg. overshot rate} = \frac{1}{n}{\textstyle\sum}_{j=1}^n \mathbbm{1}_{d_j < p_j},\ \  %\sum_{j=1}^n \mathbbm{1}_{d_j - p_j < 0}, \ \ 
    \text{avg. undershot rate} = \frac{1}{n}{\textstyle\sum}_{j=1}^n \mathbbm{1}_{d_j > p_j} % \frac{1}{n}\sum_{j=1}^n \mathbbm{1}_{d_j - p_j > 0}
\end{align*}

%Further, we assume the real time-to-event to be $t$. With those, we define the overshot and undershoot offset $o$ of the sequence as: $o = t - p$.
%Using this offset, we can define a prediction to be an overshot if $o > 0$ and an undershot if $o < 0$. Using those, we can define the arithmetic mean over all sequences as the average overshot and undershot rate.
%These also allow us to define the average overshoot and undershoot rate in the sequence: 
%\[
%    \text{avg. overshot rate} = \frac{1}{l}\sum_i^l \mathbbm{1}_{o_i > 0} \text{ and } \text{avg. undershot rate}  = \frac{1}{l}\sum_i^l \mathbbm{1}_{o_i < 0},
%\]
%with $\mathbbm{1}_{o_i > 0} = 1\text{ if } o_i > 0 \text{ and } \mathbbm{1}_{o_i > 0}=0 \text{ if } o_i \leq 0$.

\paragraph{Average sequence and undershot error:} Using the historic gaze information provided by the eye-tracker, we would not perform an action in case of an overshot. This is not the case for an undershot, as we cannot exploit the additional information that would imply a wrongfully performed action. %as we can detect it using new data samples from the eye-tracker,
%we can define the undershot error as the part of the prediction where it undershoots the signal. In this case, we would not notice this undershoot when we use the predictor in an application, as we cannot exploit any additional information.
Hence, we calculate the undershot error only in cases where the prediction of a model undershoots the actual duration, and assume a perfect prediction otherwise, by defining the average undershot error as:
\begin{align*}
    \text{average undershot error}_f &= \frac{1}{n} {\textstyle\sum}_{j=1}^n f(d_j, p_j)\cdot \mathbbm{1}_{p_j < d_j},
\intertext{
using the indicator function~$\mathbbm{1}$ and the error metrics $f\in \{\text{mse}, \text{mae}\}$.
To calculate the average time-to-saccade error (avg. tts.), we %omit this assumption and estimate 
use previous definitions of time-to-saccade duration $d_j$ and average time-to-saccade prediction $p_j$ of a sequence $j$:%, following previous work \cite{yang2020predicting}:
%compute the error based on the average time-to-saccade $d$ of the sequence $j$, same as previous work for scan path prediction \cite{yang2020predicting}:
}
    \text{average sequence error}_f &= \frac{1}{n} {\textstyle\sum}_{j=1}^n f(d_j, p_j)
\end{align*}
%To stay consistent with previous literature, we utilize the mean absolute and mean square error for undershot error estimation.

\paragraph{Sectioning:} The prediction of a model may change with time. Depending on the utilized method, it might not have enough information at the beginning to predict an accurate time-to-saccade. %This could be for example be at the beginning during the execution of the last saccade.
As a result, the prediction may improve over time without being inherently evident from the evaluation when using the earlier mentioned metrics. Hence, we split each time-to-saccade sequence $S_j[1,\dots,l_j]$ of length $l_j$ into $k$ sections $s_{j_k} = S_j[\ceil{\frac{l_j}{k}\cdot (k - 1)},\dots, \floor{\frac{l_j}{k}\cdot k}]$. Then we calculate the error over all sections $S_m = \{s_{0_m}, \dots, s_{n_m}\}$ of the same bin $m$, showing the behavior of the error over time.
%Moreover, we also define $se_{\text{sacc.}}, se_{\text{fix}}$ as the error of the part of the time-to-saccade sequence which is still classified as a saccade or fixation.
%To estimate the predictive performance before the actual occurring event, we define $se_{\text{fin}}$ as the error of the last 10\% of the sequence.

\section{Evaluation \& Discussion}
To evaluate our approach defined in Sec.~\ref{sec:Methodology} we utilize a linear regressor with a Nyström approximation \cite{williams2000using} trained through stochastic gradient descent \cite{robbins1951stochastic}. This has been chosen, as it was identified as the best performing regressor among four other classical methods \cite{rolff2022saccade}. The models were trained as specified in \cite{rolff2022saccade}. %Hence, we first perform a search over the optimal window length for each feature for the window sizes of 1, 10, 20, 40, 60, 80, 100 using 7-fold cross validation. Afterwards, we perform  a feature selection process using recursive feature elimination (RFE) \cite{guyon2002gene} with cross validation. Same as \citet{rolff2022saccade} we discard the whole feature window instead of single features. We perform Bayesian optimization during the RFE and the window length estimation, to find a good set of hyperparameters. But due to computational constraints, we perform the hyperparameter selection once at the beginning of the RFE. Using the selected features of the RFE process, we then fit a final regressor that is used for evaluation.
One notable exception while training is the used sampling strategy. Here, we made sure that samples leading to the same gaze event are placed inside the same train, test, or validation dataset. To train the models, we utilize the DGaze \cite{hu2020dgaze}, FixationNet \cite{hu2021fixationnet} and EGTEA Gaze+ \cite{li2018eye} datasets. In addition, we use some artificial prediction strategies to evaluate the proposed metrics on synthetically generated predictions. For those, we employ: mean time-to-saccade (mean), zero prediction (zero), maximum time-to-saccade (max), and random time-to-saccade prediction (rand). A more extensive evaluation of those can be found in the appendix.
\begin{table}[!t]
    \centering
    \caption{Results of the SGD regressor and average time-to-event using the metrics described in Sec.~\ref{sec:Methodology} along with the mean square error (mse) and mean absolute error (mae). A lower error is preferred.}
    \resizebox{\textwidth}{!}{%
    \begin{tabular}{l|c|c|c|c|c|c}
        \toprule
        Metric & \multicolumn{3}{c}{SGD} & \multicolumn{3}{c}{avg. time-to-event}\\
        & DGaze & FixationNet & EGTEA & DGaze & FixationNet & EGTEA\\ 
        \midrule
        mse$\downarrow$
                       & 0.1285 \si{\second}$^2$ & 0.2390 \si{\second}$^2$ & 0.1672 \si{\second}$^2$
                       & 0.2314 \si{\second}$^2$ & 0.3647 \si{\second}$^2$ & 0.3043 \si{\second}$^2$\\
        mae$\downarrow$
                       & 0.2556 \si{\second}     & 0.3567 \si{\second}     & 0.2668 \si{\second}
                       & 0.3387 \si{\second}     & 0.4261 \si{\second}     & 0.3677 \si{\second}\\
        avg. tts mse$\downarrow$
                       & 0.0494 \si{\second}$^2$ & 0.0887 \si{\second}$^2$ & 0.0420 \si{\second}$^2$
                       & 0.0672 \si{\second}$^2$ & 0.1035 \si{\second}$^2$ & 0.0745 \si{\second}$^2$\\
        avg. tts mae$\downarrow$
                       & 0.1747 \si{\second}     & 0.2422 \si{\second}     & 0.1484 \si{\second}
                       & 0.1792 \si{\second}     & 0.2260 \si{\second}     & 0.1765 \si{\second}\\
        undershot mse$\downarrow$
                       & 0.0319 \si{\second}$^2$ & 0.0537 \si{\second}$^2$ & 0.0311 \si{\second}$^2$
                       & 0.0664 \si{\second}$^2$ & 0.1003 \si{\second}$^2$ & 0.0718 \si{\second}$^2$\\
        undershot mae$\downarrow$
                       & 0.0857 \si{\second}     & 0.1105 \si{\second}     & 0.0799 \si{\second}
                       & 0.1666 \si{\second}     & 0.1968 \si{\second}     & 0.1470 \si{\second}\\
        o/u shot rate$\downarrow$
                       & 0.61/0.39 & 0.64/0.36 & 0.60/0.40
                       & 0.27/0.73 & 0.36/0.64 & 0.43/0.57\\
        consistency$\downarrow$
                    & 1.64    & 1.39 & 1.12
                    & 1.0  & 1.0  & 1.0\\
        \bottomrule
    \end{tabular}
    }
    \label{tab:sgd}
\end{table}
\begin{figure}
    \centering
    \includegraphics[width=0.49\textwidth]{images/mae-dgaze.png}%
    \includegraphics[width=0.49\textwidth]{images/mae-fixationnet.png}
    \caption{Error of different predictors on different sections as explained in Sec.~\ref{sec:Methodology}. We divide all time-to-saccade sequences into 10 different sections of equal length to estimate the mean square error on the DGaze \cite{hu2020dgaze} (left) and FixationNet \cite{hu2021fixationnet} (right) datasets.}
    \label{fig:sections}
\end{figure}

Table \ref{tab:sgd} shows the measured results of the predictions on the DGaze~\cite{hu2020dgaze}, FixationNet~\cite{hu2021fixationnet} and EGTEA Gaze+~\cite{li2018eye} datasets. While close to previous literature \cite{rolff2022saccade}, the results are still slightly different due to the different sampling method. However, it is also evident, that this results in a higher overshot rate, as the predictor can not estimate the correct time-to-saccade for most data samples. Moreover, the consistency of the SGD predictor is not as optimal as the average prediction. This is expected, as the average predictor reports very consistent results by predicting the mean value for every sample.
It can also be seen that the undershot error reports much lower results for the SGD predictor when compared to the average time-to-event. This is consistent with the undershot rate, as the predictor undershoots less than the average predictor, making it more useful for real-world applications. Here we assume undershots to be more of a problem than overshots due to their ability to trigger downstream methods with the user being aware of them. In contrast, an overshot can be mitigated through the utilization of data samples from the eye-tracker. Fig.~\ref{fig:sections} shows the evaluation of the 5 different baseline predictor models along with the SGD predictor over multiple sections of all sequences. Here, it can be seen that the SGD predictor outperforms all baselines most of the time, except for a brief range 20-30\% of the length before the actual event, where it is outperformed by the mean absolute error.
This indicates that the SGD tends to do better than the other predictor, but eventually fails shortly before the actual event.
We also performed additional evaluations, which can be found in the appendix due to space restrictions.

\section{Conclusion}
In this paper, we proposed a new sampling strategy that lets us take the sequential information of gaze data for time-to-saccade prediction into account. This enabled us to define multiple new metrics capturing the consistency and duration of time-to-saccade predictors, as well as capturing the overall behavior of them over different parts of time-to-saccade sequences. To evaluate these, we use the state-of-the-art time-to-saccade predictor and compared it against a simple average baseline. However, we also expect future work on this topic, especially overshot and undershot evaluation, as they currently just evaluate the average over- and undershot over the whole sequence but do not take the prediction strategy of a proposed model into account.