\section{Related Work}
\paragraph{Gaze Estimation.} Gaze estimation methods are built using large-scale datasets either having 2D target labels~\citep{krafka2016eye, huang2017tabletgaze} or 3D gaze directions~\citep{fischer2018rt, funes2014eyediap, zhang2015appearance}. Broadly, gaze estimation methods can be divided into two categories: appearance methods~\citep{tan2002appearance}, which directly map image pixels to 3D gaze direction, and model methods which rely on eye-geometry~\citep{hansen2009eye}. Appearance methods perform better than traditional model methods in real-world settings~\citep{hansen2009eye, zhang2015appearance}. 

Recent progress in appearance methods relies heavily on %deep (CNN-based) neural networks 
deep learning to map eye/face images to gaze directions~\citep{zhang2015appearance, zhang2017s}. Furthermore, a few gaze methods are hybrid. For instance, \citet{park2018learning} extracted the relevant eye landmarks from the images and then used these features to train gaze estimators. Other than eye images, several works~\citep{krafka2016eye, cheng2020gaze, FischerECCV2018, cheng2020coarse} exploit both eyes and face images in computing gaze direction. Nevertheless, both appearance-based methods and hybrid extensions require a huge amount of labeled data to achieve their potential in terms of accuracy.

As a result of huge label dependence by appearance methods, efforts have been made in the direction of few-shot gaze estimation. In particular, \citet{liu2018differential} exploit a two-branch network to predict differential gaze between two images and use a few calibration samples during inference. Furthermore, \citep{park2019few, zheng2020self} disentangle gaze from other nuisance factors via training an encoder-decoder architecture~\citep{hinton2011transforming} to learn gaze specific representations. Recent approaches~\citep{wang2022contrastive, bao2022generalizing, jindal2021cuda} leverage labeled source domain and unsupervised domain adaptation for improving the performance of gaze estimation task on the target domain.
% and use %learned representation for gaze estimation through
% a meta-learning~\cite{finn2017model} algorithm.
%After that, a few works~\cite{chen2020offset, chen2020geddnet} learn subject-specific bias factors and rely on a few calibration samples at test time. 
In a similar spirit, our work attempts to reduce the amount of required label information via learning gaze representations without relying on gaze labels and utilizing multi-view data~\citep{park2020towards, zhang2020eth}.

\paragraph{Self-Supervised Learning.} The goal of self-supervised representation learning is to learn good visual representations from a large collection of unlabeled images. Earlier works in SSL~\citep{zhang2016colorful, noroozi2016unsupervised, noroozi2017representation, Doersch_2015_ICCV} used pretext tasks to learn generalizable semantic representations. Some of the recent works~\citep{misra2020self, he2020momentum, chen2020simple, chen2020improved, caron2020unsupervised, chen2021exploring, grill2020bootstrap} have shown great success on several vision tasks, e.g, image classification~\citep{caron2018deep, dangovski2022equivariant}, object detection~\citep{crawford2019spatially}, semantic segmentation~\citep{moriya2018unsupervised}, and pose estimation~\citep{rhodin2018unsupervised}. The recent work by~\citet{spurr2021self} extends SSL to hand pose estimation through geometric equivariance representations. \citet{tian2020contrastive} propose to use more than two views to learn invariant representations through contrastive learning. 
% However, to our knowledge, contrastive representation learning has not yet been explored for the gaze estimation task.

Recently, a few unsupervised %learning 
methods have been proposed to learn gaze-specific representations. Specifically, \citet{yu2020unsupervised} exploit the gaze redirection task to train a gaze estimation model using paired eye images of the same subject. Similarly,~\citet{sun2021cross} proposed a cross-encoder method to utilize patches of left and right eye images of the same subject %in one frame 
as the self-supervised signal. %to disentangle appearance and gaze information.
\citet{gideon2022unsupervised} is an extended version of \citet{sun2021cross} utilizing multi-view images and learning features representing head pose and relative gaze to improve in-domain few-shot gaze estimation performance.  However, unlike our work, these methods 
employ an encoder-decoder framework and thus require a relatively large number of parameters. %, compared to an encoder-only approach, like ours. 
Also, contrastive SSL approaches are computationally efficient compared to generative SSL approaches \citep{liu2021self}.


