\section{Definition of Equivariance of Gaze Direction w.r.t. Viewpoints}
This section elaborates the equivariance relationship between gaze directions in multi-view geometry, which is also the key idea for \gazeclr{} training framework. Given a specific timestamp in a video,  let two samples from different camera viewpoints with gaze directions be $g_{v_1}$ and $g_{v_2}$ in their original respective camera reference system, then the relation between these two gaze directions through their relative camera pose (i.e., $R_{C_1}^{C_2}$), can be given as follows:
\begin{equation}
\begin{split}
    g_{v_2}  &= R_{C_1}^{C_2} g_{v_1}\\
    g_{v_2}  &= R_{S}^{C_2} R_{C_1}^{S} g_{v_1} \\
    (R_{S}^{C_2})^{-1} g_{v_2} &= (R_{S}^{C_2})^{-1} R_{S}^{C_2} R_{C_1}^{S}  g_{v_1}\\
    R_{C_2}^{S} g_{v_2} &= R_{C_1}^{S}  g_{v_1}\\
    \bar{g}_{v_1} &= \bar{g}_{v_2} 
\end{split}
\end{equation}
\noindent
where $R_S^{C_i}$ is relative pose between camera view $i$ and screen. Hence, we follow similar relationship, i.e., $R_{C_2}^{S} g_{v_2} = R_{C_1}^{S}  g_{v_1}$, for embeddings obtained from multi-view learning branch and minimizing $\mathscr{L}^{E}$ (Equation~\ref{loss:equivinfonce}) will yield $\bar{z}_{v_1} = \bar{z}_{v_2}$. This relation is shown as rotation symbol in Figure~\ref{fig:arch}.


% \section{Experimental Setup -- Additional Details}
% \label{appendix:implementation}

% \paragraph{Architectural details.} 
% All experiments use ResNet-18~\citep{he2016deep} as the encoder network and take the output from the average pooling layer. The encoder is trained from scratch. Following~\citet{chen2020simple}, both projection heads $p_1(\cdot)$ and $p_2(\cdot)$ are two-layer MLP networks with ReLU non-linearity. The output dimensions for the first and second layers are $512$ and $180$, respectively. The input image size is $128\times 128$. 

% \paragraph{Training details.} \gazeclr{} is trained using SGD optimizer with initial learning rate $=0.03$, momentum $=0.9$, and cosine annealing~\citep{loshchilov2016sgdr} for the learning rate decay. We use a single 1080 GeForce GTX GPU for training, with a batch size of 128, and train for 50K iterations. Our mini-batch is made up of samples from a single participant. The temperature coefficient $\tau$ is set to $0.1$. For the augmentation transformations $\mathcal{A}$, we apply random spatial cropping and resizing, gaussian blur, color perturbation ($p=0.8$) on  brightness, contrast, saturation and hue,  grayscale conversion ($p=0.2$) and auto-contrast ($p=0.5$).


% \paragraph{Data pre-processing.} We use face images available in the EVE
% % \footnote{This dataset is licensed under a \href{https://creativecommons.org/licenses/by-nc-sa/4.0/}{CC BY-NC-SA 4.0.}} 
% dataset, obtained after applying a data-normalization procedure~\citep{sugano2014learning, zhang2018revisiting}. The normalization pipeline transforms the gaze annotation to a normalized camera space through a rotation matrix $M$. Note that we post-multiply $R_C^S$ with $M^{-1}$ as $R_C^S$ is defined w.r.t. the original camera reference frame, i.e., $\bar{z}_{v} = R_{C_v}^S (M)^{-1} \hat{z}_{v}$.

% \paragraph{Training details.} \gazeclr{} is trained using SGD optimizer with initial learning rate $=0.03$, momentum $=0.9$, and cosine annealing~\cite{loshchilov2016sgdr} for the learning rate decay. We use a single 1080 GeForce GTX GPU for training, with a batch size of 128, and train for 50K iterations. Our mini-batch is made up of samples from a single participant. The temperature coefficient $\tau$ is set to $0.1$. For the augmentation transformations $\mathcal{A}$, we apply random spatial cropping and resizing, gaussian blur, color perturbation ($p=0.8$) on  brightness, contrast, saturation and hue,  grayscale conversion ($p=0.2$) and auto-contrast ($p=0.5$).


% \paragraph{Data pre-processing.} We use face images available in the EVE\footnote{This dataset is licensed under a \href{https://creativecommons.org/licenses/by-nc-sa/4.0/}{CC BY-NC-SA 4.0.}} dataset, obtained after applying a data-normalization procedure~\cite{sugano2014learning, zhang2018revisiting}. The normalization pipeline transforms the gaze annotation to a normalized camera space through a rotation matrix $M$. Note that we post-multiply $R_C^S$ with $M^{-1}$ as $R_C^S$ is defined w.r.t. the original camera reference frame, i.e., $\bar{z}_{v} = R_{C_v}^S (M)^{-1} \hat{z}_{v}$.


\section{Additional Results}
\label{sec:additional-res}

\subsection{Further Transfer Learning Evaluation}

To further evaluate the transferable capability of learned representations obtained from \gazeclr{} framework, we use \textit{Finetuning (FT)} protocol. Here, we fine-tune the entire network (including the encoder) in an end-to-end manner on the target dataset using a few calibration samples from the test subject, and evaluate on the remaining samples.


In Table~\ref{table:cross-data-eval}, we present the results for FT on MPIIGaze and Columbia, where we fine-tune the whole end-to-end network. For this experiment, we adopt architecture from \citet{chen2020offset}, where a subject-dependent bias term is learned along with an end-to-end network. 4-fold and leave-one-out (15-fold) evaluation protocols are used for Columbia and MPIIGaze, respectively. 

Unlike \citet{chen2020offset},  our input is a full face image, and the backbone is a pre-trained encoder. We take a few calibration samples for each subject during inference and estimate the subject-dependent bias term. We evaluate performance on the remaining samples and repeat this calibration for 10 runs for each subject. Table~\ref{table:cross-data-eval} provides mean and standard deviation of angular errors over 10 runs. We compare the performance of our method with other baselines for various few-shot settings. Results demonstrate that our method consistently outperforms all other pre-training baselines, including \citet{chen2020offset} (w/o Pre-training) for all few-shot settings. This indicates the improved generalization capability of our learned representations, particularly on the MPIIGaze dataset. Also, we observe that our method is either superior or competitive with other baselines on the Columbia dataset. %  We suspect the reason is small size of Columbia dataset which effects fine-tuning the entire network.  


\begin{table*}[]
\caption{\textbf{Transfer Learning Evaluation (Finetuning).} Comparison of various baselines for the \textit{Finetuning} experimental protocol on multiple few-shot settings, for both MPIIGaze and Columbia. Here, we fine-tune whole end-to-end network and utilize few calibration samples during test time. 
The errors are computed from 10 runs and reported as (\meanstd{mean}{std}).}
\label{table:cross-data-eval}
\centering
\resizebox{\columnwidth}{!}{%
	\begin{tabular}{l|c|c|c|c|c|c|c} 
	\hline
	   & \multicolumn{7}{c}{\textbf{MPIIGaze}} \\ 
    \hline
	 \textbf{Method}  &  1 & 3 & 5 & 9 & 15 & 50  & 64 \\ 
	\shline  
	w/o Pre-training~\citep{chen2020offset}  & \meanstd{5.57}{1.60} & \meanstd{4.65}{0.71} & \meanstd{4.40}{0.40}  & \meanstd{4.22}{0.27}  & \meanstd{4.13}{0.17} & \meanstd{4.00}{0.04} & \meanstd{4.00}{0.04} \Tstrut{} \\
	
    Autoencoder  & \meanstd{5.65}{1.60} & \meanstd{4.69}{0.76} & \meanstd{4.42}{0.45} & \meanstd{4.16}{0.21} & \meanstd{4.10}{0.16} & \meanstd{3.97}{0.05}  &  \meanstd{3.96}{0.04}\Tstrut \\
    
    Novel View Synthesis~\citep{rhodin2018unsupervised}  & \meanstd{5.53}{1.32} & \meanstd{4.75}{0.63} & \meanstd{4.46}{0.40} & \meanstd{4.27}{0.25} & \meanstd{4.17}{0.15} &  \meanstd{4.06}{0.04} & \meanstd{4.06}{0.04}\Tstrut \\
    
    BYOL~\citep{grill2020bootstrap}   & \meanstd{5.71}{1.63} &  \meanstd{4.71}{0.66}  & \meanstd{4.35}{0.31}  &  \meanstd{4.22}{0.21} & \meanstd{4.11}{0.15}  &    \meanstd{4.01}{0.05} & \meanstd{4.00}{0.04}  \Tstrut\\ 
    
    SIMCLR~\citep{chen2020simple}   & \meanstd{4.87}{1.51} &  \meanstdred{3.93}{0.54} & \meanstd{3.74}{0.35} &  \meanstd{3.57}{0.24} &  \meanstd{3.47}{0.12} &  \meanstd{3.39}{0.04}  &  \meanstd{3.38}{0.03}\Tstrut \\ 
    

    
    \textbf{GazeCLR (Equiv)} & \meanstdblue{4.70}{1.49} & \meanstdblue{3.77}{0.51}   & \meanstdblue{3.51}{0.32}  & \meanstdblue{3.39}{0.18}  & \meanstdblue{3.33}{0.11}  & \meanstdblue{3.25}{0.03}  & \meanstdblue{3.24}{0.02}  \Tstrut  \\ 
    
    \textbf{GazeCLR (Inv+Equiv)} & \meanstdred{4.72}{1.33} &  \meanstd{3.93}{0.54}  &  \meanstdred{3.68}{0.34} &  \meanstdred{3.54}{0.19}  & \meanstdred{3.44}{0.11}  & \meanstdred{3.37}{0.03}    & \meanstdred{3.35}{0.03}\Tstrut \\ 
	\shline
% 	\hline
	   & \multicolumn{7}{c}{\textbf{Columbia}} \\ 
% 	\hline
	\shline \\[-2.6ex]
	
	w/o Pre-training~\citep{chen2020offset}  & \meanstd{6.96}{0.55} & \meanstd{5.73}{0.20}   & \meanstd{5.38}{0.14}  & \meanstd{5.23}{0.09}  & \meanstd{5.13}{0.05}  & \meanstd{5.04}{0.08}  &  \meanstd{5.00}{0.09} \Tstrut\\

    Autoencoder  &  \meanstd{7.00}{0.57}  & \meanstd{5.79}{0.18}   &  \meanstd{5.49}{0.15} &  \meanstd{5.24}{0.07} & \meanstd{5.15}{0.04}  &  \meanstd{5.03}{0.08}  & \meanstd{5.03}{0.07} \Tstrut \\
    
    Novel View Synthesis~\citep{rhodin2018unsupervised}  & \meanstd{7.38}{0.60} &  \meanstd{6.05}{0.22}  & \meanstd{5.78}{0.14}  &  \meanstd{5.51}{0.05} &  \meanstd{5.43}{0.06} &  \meanstd{5.33}{0.06}  &  \meanstd{5.27}{0.08}  \\
   
    BYOL~\citep{grill2020bootstrap}   & \meanstd{6.09}{0.41} &   \meanstd{4.97}{0.22} & \meanstd{4.70}{0.13}  & \meanstd{4.55}{0.09}   & \meanstd{4.43}{0.04}  &  \meanstd{4.35}{0.05}   & \meanstd{4.34}{0.06}  \Tstrut \\ 

    SIMCLR~\citep{chen2020simple}   & \meanstd{4.36}{0.20} &   \meanstd{3.67}{0.13} & \meanstd{3.44}{0.07}  &  \meanstd{3.34}{0.05} &  \meanstd{3.27}{0.04} &   \meanstd{3.21}{0.04}  &  \meanstd{3.19}{0.05} \Tstrut \\ 
    

    \textbf{GazeCLR (Equiv)} & \meanstdblue{4.34}{0.25} &  \meanstdblue{3.60}{0.12}  &  \meanstdblue{3.42}{0.09} &  \meanstdblue{3.30}{0.04} &  \meanstdblue{3.26}{0.02} & \meanstdblue{3.17}{0.04}  & \meanstdblue{3.17}{0.02}  \Tstrut\\ 
    

    \textbf{GazeCLR (Inv+Equiv)} & \meanstdred{4.54}{0.24} &  \meanstdred{3.75}{0.12}  & \meanstdred{3.59}{0.08}  &  \meanstdred{3.45}{0.05}  &  \meanstdred{3.39}{0.03} &  \meanstdred{3.31}{0.04}   & \meanstdred{3.31}{0.04}  \Tstrut\\ 
    \shline
	\end{tabular}
	}
\end{table*}


\section{Ablation Studies}
\label{sec:ablations}

\subsection{Increasing number of views improves pre-training} In Table~\ref{tab:moreviews}, we demonstrate the effect of increasing number of views used in pre-training stage of \gazeclr{}. For this ablation study, we conducted experiment for cross-dataset under LLT  (similar to Fig. 3) and within-dataset (similar to Table 1) settings, shown in Table~\ref{tab:moreviews}(a) and Table~\ref{tab:moreviews}(b) respectively. For 2 views, we considered center and right cameras and for 3 views left camera is included. For LLT setting, the difference in \gazeclr{} performance for $2/3$ views and all $4$ views is relatively higher, especially with less number of shots. This shows that for smaller $k$, more views are helpful for \gazeclr{}. Similarly, for within-dataset, \gazeclr{} performance deteriorates with $2/3$ views compared to $4$ views.

% \begin{table}[h]
% \caption*{\textbf{LLT Cross-dataset evaluation (i.e. Fig. 3 setting)}}
% \label{tab:moreviews}
% \centering
% \resizebox{0.9\columnwidth}{!}{%
% 	\begin{tabular}{l|c|c|c|c} 
% 	\hline
% 	\textbf{Dataset} & \textbf{\# of views} & \textbf{$k=20$} & \textbf{$k=50$} & \textbf{$k=64$} \Tstrut{} \Bstrut{}\\
% 	\hline
% 	MPIIGaze & 2 & \meanstd{8.94}{1.23} & \meanstd{7.59}{1.46} & \meanstd{7.25}{1.41}\\
% 	Columbia & 2 & \meanstd{7.63}{0.77} & \meanstd{4.58}{0.48} & \meanstd{4.02}{0.51} \\
% 	\hline
% 	MPIIGaze & 3 & \meanstd{8.38}{1.06} & \meanstd{7.09}{1.25} & \meanstd{6.78}{1.29} \\
% 	Columbia & 3 & \meanstd{7.20}{0.68} & \meanstd{4.45}{0.50} & \meanstd{3.88}{0.45} \\
% 	\hline  
% 	\end{tabular}
% 	}
% \end{table}

% \begin{table}[h]
% \caption*{\textbf{Within-dataset evaluation (i.e. Table 1 setting)}}
% \label{table:ablation-different-views-within}
% \centering
% \resizebox{0.6\columnwidth}{!}{%
% 	\begin{tabular}{l|c|c} 
% 	\hline
% 	\textbf{Method} & \textbf{\# of views} &\textbf{ MAE (degrees)} \Tstrut{} \Bstrut{}\\
% 	\hline  
% 	\gazeclr{} & 2 & 7.72 \\
% 	\gazeclr{} & 3 & 7.06 \\
% 	\hline
% 	\end{tabular}
% 	}
% \end{table}


\subsection{More data, better pre-training} In Table~\ref{tab:ablation}(a), we study the impact of amount of unlabeled data used for the pre-training stage of \gazeclr{} framework. We observe that the representations learned by \gazeclr{} benefit from more training data and help in generalizing across different domain datasets.

\subsection{Larger batch-size is useful} Next, we vary the batch size to analyze its effect on pre-training, for which results are shown in Table~\ref{tab:ablation}(b).  We notice that the larger batch size considerably impacts the quality of  representations and improves the performance significantly. This observation is consistent to previously observed findings in the self-supervised learning literature~\citep{chen2020simple, he2020momentum}.


% \begin{table*}
% \caption{\textbf{Ablation on increasing number of views.} Within-dataset and cross-dataset (LLT) evaluation with increasing number of views used for pre-training stage of \gazeclr{} on both MPIIGaze and Columbia.  The ablation study is performed for \gazeclr\textit{(Equiv)} method and evaluation metric is mean angular error (MAE) in degrees, average over 10 runs.}
% \label{tab:moreviews}
% \centering
% % \resizebox{\columnwidth}{!}{%
%     \begin{subtable}[][c]{0.6\columnwidth}
%     %   \centering
%       \resizebox{\textwidth}{!}{%
%         \begin{tabular}{l|c|c|c|c}
%             \hline
% 	        \textbf{Dataset} & \textbf{\# of views} & \textbf{$k=20$} & \textbf{$k=50$} & \textbf{$k=64$} \Tstrut{} \Bstrut{}\\
%             \shline  
%             MPIIGaze & 2 & \meanstdmean{8.94}{1.23} & \meanstdmean{7.59}{1.46} & \meanstdmean{7.25}{1.41}\\
% 	        Columbia & 2 & \meanstdmean{7.63}{0.77} & \meanstdmean{4.58}{0.48} & \meanstdmean{4.02}{0.51} \\
% 	        \hline
% 	        MPIIGaze & 3 & \meanstdmean{8.38}{1.06} & \meanstdmean{7.09}{1.25} & \meanstdmean{6.78}{1.29} \\
% 	        Columbia & 3 & \meanstdmean{7.20}{0.68} & \meanstdmean{4.45}{0.50} & \meanstdmean{3.88}{0.45} \\
% 	        \hline
% 	        MPIIGaze & 4 & \meanstdmean{8.16}{1.06} & \meanstdmean{7.15}{1.25} & \meanstdmean{6.85}{1.29} \\
% 	        Columbia & 4 & \meanstdmean{6.80}{0.68} & \meanstdmean{4.46}{0.50} & \meanstdmean{3.90}{0.45} \\
%             \shline  
%         \end{tabular}
%         }
%           \subcaption{LLT Cross-dataset evaluation}
%     \end{subtable}
%     \hfill
%     \begin{subtable}[][c]{0.4\columnwidth}
%       \centering
%         \begin{tabular}{l|c|c}
%             \hline
% 	        \textbf{Method} & \textbf{\# of views} &\textbf{ MAE (degrees)} \Tstrut{} \Bstrut{}\\
%             \shline  
%         	\gazeclr{} & 2 & 7.72 \\
%         	\gazeclr{} & 3 & 7.06 \\
%         	\gazeclr{} & 4 & 4.83 \\
%             \shline  
%         \end{tabular}
%           \subcaption{Within-dataset evaluation}
%     \end{subtable} 
%     % }
% \end{table*}


\begin{table}[!ht]
  \centering
%   \resizebox{\columnwidth}{!}{%
 \caption{\textbf{Ablation on increasing number of views.} Within-dataset and cross-dataset (LLT) evaluation with increasing number of views used for pre-training stage of \gazeclr{} on both MPIIGaze and Columbia.  The ablation study is performed for \gazeclr\textit{(Equiv)} method and evaluation metric is mean angular error (MAE) in degrees, average over 10 runs.}
\label{tab:moreviews}
  \subfloat[LLT Cross-dataset evaluation]{
    \footnotesize
    \centering
    \begin{tabular}{l|c|c|c|c}\hline
      \hline
	        \textbf{Dataset} & \textbf{\# of views} & \textbf{$k=20$} & \textbf{$k=50$} & \textbf{$k=64$} \Tstrut{} \Bstrut{}\\
            \shline  
            MPIIGaze & 2 & \meanstdmean{8.94}{1.23} & \meanstdmean{7.59}{1.46} & \meanstdmean{7.25}{1.41}\\
	        Columbia & 2 & \meanstdmean{7.63}{0.77} & \meanstdmean{4.58}{0.48} & \meanstdmean{4.02}{0.51} \\
	        \hline
	        MPIIGaze & 3 & \meanstdmean{8.38}{1.06} & \meanstdmean{7.09}{1.25} & \meanstdmean{6.78}{1.29} \\
	        Columbia & 3 & \meanstdmean{7.20}{0.68} & \meanstdmean{4.45}{0.50} & \meanstdmean{3.88}{0.45} \\
	        \hline
	        MPIIGaze & 4 & \meanstdmean{8.16}{1.06} & \meanstdmean{7.15}{1.25} & \meanstdmean{6.85}{1.29} \\
	        Columbia & 4 & \meanstdmean{6.80}{0.68} & \meanstdmean{4.46}{0.50} & \meanstdmean{3.90}{0.45} \\
            \shline  
    \end{tabular}
  }
  \subfloat[Within-dataset evaluation]{
    \small
    \centering
    \begin{tabular}{c|c}\hline
     \hline
	        \textbf{\# of views} &\textbf{ MAE (degrees)} \Tstrut{} \Bstrut{}\\
            \shline  
        	2 & 7.72 \\
        	3 & 7.06 \\
        	4 & 4.83 \\
            \shline  
    \end{tabular}
  }
%   }
\end{table}


\begin{table*}
\centering
\caption{\textbf{Ablation Study.} 20-shot \textit{linear layer training} for the cross-data gaze estimation on MPIIGaze and Columbia, for two different ablation settings. Ablations are performed for the  \gazeclr\textit{(Equiv)} method and evaluation metric is mean angular error (MAE) in degrees.}
\label{tab:ablation}
  \subfloat[Varying amount of pre-training data]{
    \small
    \centering
        \begin{tabular}{c|c|c}
            \hline
            Pre-Train Data & MPIIGaze & Columbia \Tstrut{} \Bstrut{}\\
            \shline  
            MiniEVE & 11.25 & 9.63 \Tstrut{} \Bstrut{}\\
            EVE & \textbf{8.16} & \textbf{6.80} \Tstrut{} \Bstrut{}\\
            \shline  
        \end{tabular}}
    \hfill
    \subfloat[Varying batch-size used for pre-training]{
    \small
    \centering
        \begin{tabular}{c|c|c}
            \hline
            Batch size & MPIIGaze & Columbia \Tstrut{} \Bstrut{} \\
            \shline  
            32 & 12.21 & 12.83 \Tstrut{}\\
            128 & \textbf{8.16} & \textbf{6.80} \Tstrut{} \Bstrut{}\\
            \shline  
        \end{tabular}}
\end{table*}

\begin{table*}[h]
\caption{\textbf{Ablation Study for mini-batch containing single \textit{vs.} multiple participants.} Within-dataset evaluation under two different types of batches created for the  \gazeclr\textit{(Equiv)} method and evaluation metric is mean angular error (MAE) in degrees.}
\label{ablation:singleidentity}
  \centering
    \begin{tabular}{c|c|c}
    \hline
    Task Data & Batch Type & MAE (degrees) \Tstrut{} \Bstrut{}\\
    \shline  
    MiniEVE & Single & \textbf{4.83} \Tstrut{} \Bstrut{}\\
    MiniEVE & Multiple & 23.58 \Tstrut{} \Bstrut{}\\
    \shline  
    \end{tabular}
\end{table*}
    

\subsection{Mini-batch of single \textit{vs.} multiple participants} In Table~\ref{ablation:singleidentity}, we experiment with creating batches from single and multiple subjects samples and compare them under within-dataset evaluation (similar to Table 1). We observe that the performance on the gaze estimation task with multiple subject samples was close to the performance of random weights. We hypothesize that this is because in batches with different subjects, negative pairs are easy to classify, given the subject's identity. Therefore, the network has no incentive to focus on gaze information over subject identity.


\section{Supervised Fine-tuning for Gaze Estimation}
In the main manuscript, we demonstrated that the self-supervised gaze representations learned using \gazeclr{} can perform well on a variety of settings when finetuned on the target dataset. Here, we investigate on how performance varies with respect to the amount of data available for finetuning. We evaluate for the within-dataset gaze estimation using \textit{linear layer training} protocol, starting from $10\%$ of EVE training dataset, and gradually increasing to $100\%$. We compare \gazeclr(\textit{Equiv}) and \gazeclr(\textit{Inv}+\textit{Equiv}) against ``w/o Pre-training'' baseline with random initialization, as shown in the Figure~\ref{fig:differentperc}. \gazeclr{} outperforms the baseline in all training set sizes. It is worth noting that the \gazeclr{} approach only requires $20\%$ of training data to match the performance of the ``w/o Pre-training'' baseline with $100\%$. Furthermore, notice that the gap between the performance of \gazeclr{} and baseline decreases as training dataset size increases, showing that \gazeclr{} is effective for training with a few samples.


\begin{figure}[h]
    \centering
    \includegraphics[width=0.7\columnwidth]{images/different_perc.png}
    \caption{Comparison of the gaze estimation performance for within-dataset using \textit{Linear Layer Training} protocol, versus different $\%$ of the labeled training data.}
    \label{fig:differentperc}
\end{figure}


\section{Implementation Details for Baseline Methods}
\label{appendix:baselines}

We provide further details of our implementation for the pre-training baselines, namely, Autoencoder and Novel View Synthesis~\citep{rhodin2018unsupervised}.


\paragraph{Autoencoder.} We use same encoder layers as the \gazeclr{} framework for a fair comparison. The decoder is implemented using DenseNet~\citep{huang2017densely} architecture by replacing convolutional layers with deconvolutional layers of stride 1. The average pooling layer of transition layers is replaced by $3\times 3$ deconvolutions (with stride 2). The decoder consists of 5 dense blocks, where each block has 4 composite layers with a growth-rate of 32. The compression factor is set to 1.0. All layers are implemented using instance normalization~\citep{ulyanov2016instance} and leaky ReLU activation functions (with $\alpha=0.01$). We use SGD optimizer with momentum $0.9$, weight decay $5\times 10^{-4}$, and initial learning rate is $0.003$ (which is decayed using cosine annealing scheduler~\citep{loshchilov2016sgdr}). The batch size is  $24$ and the model is trained for $200$K iterations. For inference, we remove decoder layers, and use encoder only for the task of gaze estimation.

\paragraph{Novel View Synthesis~\citep{rhodin2018unsupervised}.} This work originally was proposed for 3D human pose estimation task and aimed to learn novel view synthesis, where separate representations for body’s 3D geometry ($\mathbf{L}^{\text{3D}}$), appearance ($\mathbf{L}^{\text{app}}$), and background ($\mathbf{B}$) are trained. For a fair comparison, we train novel view synthesis framework on our dataset using the same encoder architecture as in the \gazeclr{} framework. The decoder layers are same as that of autoencoder baseline. The dimension of appearance-based code ($\mathbf{L}^{\text{app}}$) is  $32$ and of 3D geometry code  ($\mathbf{L}^{\text{3D}}$) is $480$. We ignore the background factor  ($\mathbf{B}$) in our implementation, as the EVE dataset has same background across all images. The whole framework is trained using SGD optimizer with learning rate $=0.03$, momentum $=0.9$, weight decay $=5\times 10^{-4}$, and cosine annealing for learning rate decay. The training is done for $200$K iterations, with the batch size of $16$. At each iteration, we randomly sample two views from the EVE dataset, and generate one view image from other view image similar to \citet{rhodin2018unsupervised}. The trained encoder is then adapted for the gaze estimation, similar to other baselines.


\section{Additional Visualization}
We further qualitatively analyze the relation between learned gaze representations and the ground-truth 2D Point-of-Gaze (PoG). For this, we project gaze representations to 2-dimensional space using t-SNE~\citep{van2008visualizing} algorithm and normalize them between 0 and 1. Next, we plot euclidean distance between 2D t-SNE projections and the normalized 2D PoG (dividing by width and height of screen), as shown in Figure~\ref{fig:corr}. The black line in Figure~\ref{fig:corr} is for the $y=x$ equation. We observe that data is scattered symmetrically around  $y=x$, exhibiting a strong correlation (correlation coefficient = 0.623) between gaze representations and ground-truth PoG.  

\begin{figure}[t]
    \centering
    \includegraphics[scale=0.55]{images/corr.png}
    \caption{Scatter plot between euclidean distance of normalized 2D PoG and 2D t-SNE projections of gaze representations. The black line is for $y=x$.}
    \label{fig:corr}
\end{figure}


% \section{Broader Impact}
% The proposed work presents an unsupervised framework for learning gaze representations and is trained using multi-view data. Our method improves the performance of supervised gaze estimation models for several datasets from different domains. As a result, this work is relevant to various applications that require gaze information, e.g., human-computer interaction, behavioral studies, or medical research. Furthermore, as our work focuses on improving gaze recognition performance, it may have an indirect negative impact if gaze recognition systems are deployed in a harmful manner. However, we cannot exclude the possibility that our method can be used in improving gaze recognition systems and helpful for various applications.

% The proposed work presents an unsupervised framework for learning gaze representations and is trained using multi-view data. Our method  improves the performance of supervised gaze estimation models for several datasets arising from different domains. 
% As a result, this work is relevant to various applications that require gaze direction, e.g., human-computer interaction, behavioral studies, or medical research. Furthermore, this work focuses on improving gaze estimation performance, it may have indirect negative impact if gaze recognition systems are employed in a harmful manner. 3D Gaze tracking is used in many applications and thus, we cannot exclude the possibility that our method may be used in improving gaze recognition systems for such applications. 

%TOur work can plausibly be applied in negative way as the automation of 3D gaze tracking on society; we encourage the research community to be cautious before adopting our method for their particular application.

%The proposed work presents a contrastive learning framework for gaze representations and is trained in an unsupervised manner using multi-view data. Our method can be helpful in improving the performance of gaze estimation models for different domain datasets and thus is relevant to various gaze-based applications such as Human-Computer Interaction (HCI), behavioral studies, or medical research. However, we recognize the plausible negative impacts of automation of 3D gaze tracking on society; we encourage the research community to be cautious before adopting our method for their particular application.
