\section{Introduction}
\label{sec:introduction}


%Gaze represents the focus of human attention and serves as an essential cue for non-verbal communication. While specialized gaze trackers can accurately measure a user's gaze direction, there is substantial interest in gaze estimation using regular cameras. Although, learning gaze estimation models from images is challenging and needs to transcend multiple ``nuisance" characteristics such as facial features or head orientation to produce the accurate gaze direction.

Gaze represents the focus of human attention and serves as an essential cue for non-verbal communication. While specialized gaze trackers can accurately measure a user's gaze direction, there is substantial interest in gaze estimation using regular cameras. Although, learning gaze estimation models from images is challenging and needs to transcend multiple ``nuisance'' attributes such as facial features or head orientation to estimate  gaze accurately.


%In recent years, promising results have been obtained using deep learning~\cite{zhang2015appearance, zhang2017s, krafka2016eye}. 
In recent years, deep learning~\citep{zhang2015appearance, zhang2017s, krafka2016eye} has shown promising results for gaze estimation. 
In part, this success stems from the availability of large-scale annotated datasets. As a result, valuable datasets must contain a wide range of gaze directions, appearances, and head poses, which is laborious and time-consuming procedure. 
Also, gaze annotations are difficult to obtain, which makes the creation of large, representative datasets challenging~\citep{ghosh2021automatic}. %Ground-truth gaze labeling can be obtained either by using a dedicated gaze tracking device, which only affords a  limited range of distances and gaze angles; or by asking participants to fixate specific locations in space or on a screen one at a time, which results in laborious and time-consuming procedures. 
Therefore, methods that facilitate training with limited gaze annotations are highly desirable.

%However, collecting dataset that covers the whole range of gaze directions, appearances, and head poses, is highly expensive and cumbersome~\cite{ghosh2021automatic}. %Therefore, data collection is the bottleneck for training accurate gaze estimators.

\begin{figure}[t]%
    \centering
    \subfloat[\centering Two-stages of our framework]{{\includegraphics[width=0.45\textwidth]{images/overall2.pdf} }}%
    \qquad
    \subfloat[\centering Formation of positive pairs ]{{\includegraphics[width=0.45\textwidth, height=3.5cm]{images/overall1.pdf} }}%
    \caption{\textbf{Overall idea.}  (a) The proposed two-stage learning framework for gaze estimation. Stage-I shows Gaze Contrastive Learning (\gazeclr{}) framework trained using only unlabeled data and learns both \textit{invariance} and \textit{equivariance} properties. In Stage-II, the pre-trained encoder is employed for gaze estimation task with small labeled data. (b) Two images (shown in \textcolor{red}{\textbf{red}} and  \textcolor{green}{\textbf{green}}) captured at same time with different camera views are used to create both invariant and equivariant positive pairs.}%
    \label{fig:overallidea}%
\end{figure}


Self-supervised learning (SSL) has gained tremendous success over the past few years and emerged as a powerful tool for reducing over-reliance on human annotations~\citep{he2020momentum, chen2020simple, grill2020bootstrap}. Following a generally accepted paradigm, we consider a pre-training stage that requires no labels, followed by a fine-tuning stage using a relatively small number of labeled samples. 
%Particularly, SSL methods are part of the recent paradigm that works in two stages, namely, pre-training and supervised fine-tuning. 
SSL is an effective approach for  pre-training, where semantically meaningful representations are learned that can be seamlessly adapted during
%quickly adapted for 
%the final task at hand
fine-tuning stage~\citep{caron2018deep, crawford2019spatially,moriya2018unsupervised}. 
% The hope is that, through SSL pre-training, the network may learn to generate semantically meaningful information for the final task. 
Specifically, a good pre-training would ensure that the embeddings for images associated with the same gaze direction are neighbors in the feature space, regardless of other non-relevant factors such as appearance. Arguably, this could accelerate the job of fine-tuning, possibly reducing the number of required labeled samples. % needed for training a highly accurate gaze estimation model.
%Presumably, if representations from the pre-training stage embed semantically meaningful information for the final task (e.g., relative gaze directions for gaze estimation) then even the smaller number of labeled images are sufficient to train highly accurate models. % for the final task and in  to close the gap with supervised learning performance. 
%A popular approach for SSL pre-training is \textit{contrastive representation learning}
% i.e., to embed images such that representations of positive pairs are closer to each other and negative pairs are farther away \addcite. 


In this work, for SSL pre-training, we focus on \textit{contrastive representation learning} (CRL), which aims to map ``positive" pair samples to embeddings that are close to each other, while mapping ``negative" pairs apart from each other \citep{chopra2005learning}.
%CRL requires defining large enough sets of positive and negative sample pairs. 
A popular approach is to generate pairs by applying two different transformations (or augmentations) to an input image 
% with the original and augmented image 
forming a positive pair, and different images forming negative pairs. This method encourages invariance in representations w.r.t. similar types of transformations, where these transformations are assumed to model ``nuisance'' effects. 

However, obtaining the necessary and sufficient set of ``positive'' and ``negative'' pairs remains a non-trivial and unanswered challenge for a given task. This work attempts to answer this question for gaze estimation. % task.  %However, invariance may be too much to ask in many scenarios, e,g., invariance to head pose.
%Two images of the same person looking in the same direction, but with a different head orientation, may look very different from each other. It may be unreasonable to expect that the network would generate similar embeddings for  these two images, yet generate a very different embedding for another image  with similar head pose but different gaze direction. 
%Consider, two images with same gaze direction and with a different head pose, may look very different from each other. Therefore, it's unreasonable to learn similar embeddings for these two images, yet learn a very different embedding for another image with similar head pose but different gaze direction. 
%{\bf (Here you could show a figure with three images of the same person, taken by the same camera. The first and second image look at the same direction but with different head pose. The third image has the same head pose as the second one, but looks at a different direction.} 
%Recent CRL-based methods discover ways to encourage the representations to be 
Recent CRL-based methods encourage the representations to be invariant to \textit{any} image transformation, many of which are not suitable for gaze estimation. For example, geometry-based image transformations (such as rotation) will change the gaze direction. In contrast, it is beneficial to have invariance to appearance, e.g., a person's identity, background, etc. %Furthermore, we emphasize that the equivariance relation between the geometry-based transformations and gaze direction can be leveraged for CRL. 

% While many controlled image transformations ({\em augmentations}) are not applicable for gaze direction, {\em equivariance} could serve as another way to build pairs for CRL. Equivariance means that the embedding of an input image changes in a predictable way as the image undergoes a certain parametric transformation. Consider for example head rotation, which can be parameterized by the three degrees of freedom of a 3-D rotation matrix $R$. Park et al.~\cite{park2019few} showed that it is possible to create an approximately equivariant mapping, such that, given two images of the same person, where the second image is taken after their head rotated by $R$, the associated embeddings (of size $3\times N$) are related to one another by pre-multiplication by the same rotation matrix $R$. Thus, provided that the person's head rotation $R$ can be estimated from the image, one could achieve invariance {\em post facto} from the equivariant embeddings.

% The main contribution of this paper 

%In this paper, we propose \textit{Gaze Contrastive Learning} (or \gazeclr{}) framework -- a CRL-based unsupervised pre-training approach, i.e., we never use gaze direction labels during pre-training. 
In this paper, we propose \textit{Gaze Contrastive Learning} (or \gazeclr{}) framework -- a simple CRL-based unsupervised pre-training approach for gaze estimation, i.e., a pre-training
method requiring no gaze label data. 
In detail, our approach relies on \textit{invariance} to image transforms (e.g., color jitter) that do not alter gaze direction and \textit{equivariance} to camera viewpoint, which requires additional information of multi-view geometry, i.e., images of the same person should be obtained at the same time by two or more cameras from different locations. 

For learning \textit{equivariance}, we leverage the fact that in {\em a common reference system}, two or more synchronous images of the same person from different camera viewpoints are associated with the same gaze direction. The knowledge of the relative pose of each camera to the \textit{common reference system} provides the relation of gaze directions defined in the respective camera space.
% The knowledge of the relative camera poses (which can be obtained through standard geometric calibration) provides the rotation matrices relating the gaze directions defined in respective camera reference system. 
In other words, gaze direction has an equivariant relationship to camera viewpoints. We claim that the requirement of using multiple cameras may be less onerous than obtaining gaze annotations for each image. 


We use an existing multi-view gaze dataset EVE~\citep{park2020towards} which provides video sequences captured from four calibrated and synchronized cameras and contains gaze annotations, which are obtained using a gaze tracking device~\citep{TobiiProLab}.
% {\bf (Maybe you could say something about this data set.)} 
We neglect labels during pre-training and use them only for fine-tuning and evaluation. Observe that the relative camera pose information available with the EVE dataset is used \textit{only} during the pre-training stage. Figure~\ref{fig:overallidea} presents an overview of the proposed idea. 


%Albeit contrastive representation learning is elegant, it remains non-trivial to create necessary and sufficient enough number of positive and negative pairs, for a given task. %, such as, f 3-D regression tasks such as gaze estimation. For instance, many augmentations from  
%versions of the same image, whereas a negative pair is made up of two separate images. 
% Two distorted versions of the same image are created to form a positive pair, whereas negative pair is chosen from different image. 
% However, these methods discover ways to encourage the learned representations to be invariant to any image transformation, which is not always relevant for 3-D regression tasks such as gaze estimation. For gaze estimation task, it is beneficial to have invariance to appearance transformation such as subject's identity, background, etc. Despite this, the task requires equivariance under 3D geometry-based transformations like rotation. 

% Previously, Park et al.~\cite{park2019few} proposed a representation learning method to discover rotation-equivariant mappings for gaze estimation, although their method utilizes a fully-supervised pre-training approach. Unsupervised methods~\cite{yu2020unsupervised, sun2021cross} have been proposed to learn gaze-specific representations without annotations. However, unlike these methods, our proposed framework does not require a generator stage and thus is computationally cheaper~\cite{liu2021self} to train. Furthermore, our method exploits images captured from multiple cameras from various viewpoints and learns pre-training without gaze annotations. The collection of images captured from multi-view camera is more economical and faster than acquiring gaze labels, which requires special gaze tracking hardware~\cite{TobiiProLab}. To the best of our knowledge, we are first to investigate contrastive-based self-supervised learning for the gaze estimation task. In our paper, we use the terms self-supervised and contrastive learning interchangeably.
% \my{I think this para is great but should me moved up before the para in which we mention our method.}


% \my{this is TMI for introduction, make it precise and also cite more often; also, motivation wrt prior art is missing. we must clarify the problem we are solving, i.e., exploring (less-label intensive) SSL + fine-tuning as oppose to FAZE or other representation learning framework.}

% This is because gaze annotations changes as we apply geometric transformations at the input image such as rotation or translation. Therefore, in conjunction with invariance, we need to encourage equivariance within representations to various affine transformations.





% With the development of recent multi-view gaze datasets~\cite{park2020towards, zhang2020eth}, 

%To this end, we propose a self-supervised learning approach to learn rotation-aware latent representations \textit{without} using any gaze annotations and leverage recent multi-view gaze data~\cite{park2020towards}. 

%To elucidate further, the key intuition of \gazeclr{} is that synchronous images of the same person captured from multiple viewpoints serve us various SSL ingredients. Firstly, two different augmented versions of a randomly sampled frame can be used to learn invariance to appearance transformation, similarly to previous SSL approaches~\cite{chen2020simple}. Secondly, two frames of the same person, sampled from the same timestamp but from different views, are used to learn equivariance under different geometric transformations. 
%This is because the gaze of these two frames has a known relationship given by the relative pose of two cameras (which is available along with the dataset). 
% Therefore, our approach enjoys the complementary strengths of both invariance and equivariance for learning gaze-specific representations and thus, termed as \textit{Gaze Contrastive Learning(\gazeclr{})}. 

% The key idea is that the multi-view video sequence data serves us multiple ingredients for contrastive learning \my{not clear}. Firstly, two different augmented versions of a randomly sampled frame can be used to learn invariance to appearance transformations, similar to previous SSL approaches. 
% % the temporal neighbor frames are treated as positive pairs, assuming that the gaze direction does not change significantly over a short period of time.  
% Secondly, two frames of same subject sampled from same timestamp but different views are useful for learning equivariant representations under different geometry transformations. This is because the gaze of these two frames have a known relationship given by the relative pose \my{pose or orientation} of two cameras (which is available along with the dataset). 
% Additionally, since the positive pairs are formed using video sequences of the same subject, cross-view correspondence encourages learning equivariance to viewpoint and invariance to appearance.

To evaluate the \gazeclr{}, we perform self-supervised pre-training using the EVE dataset and transfer the learned representations for the gaze estimation task in various evaluation settings. We demonstrate the effectiveness of representations by showing that the proposed method achieves superior performance on both within-dataset and cross-dataset (such as MPIIGaze~\citep{zhang2017s} and Columbia~\citep{smith2013gaze}) evaluations by using only a small number of labeled samples for fine-tuning.
%small labeled data yields improved performance on within-dataset gaze estimation task, under all evaluation settings. Furthermore, we exhibit improvements on cross-domain datasets: MPIIGaze~\cite{zhang2017s} and Columbia~\cite{smith2013gaze}, with few calibration samples, demonstrating the generalizability of our approach. 
Our major contributions are summarized as follows:
\begin{enumerate}
    \item We propose a simple contrastive learning method for gaze estimation that relies on the \textit{observation} that gaze direction is \textit{invariant} under selected appearance transformations and \textit{equivariant} to any two camera viewpoints.
    \item We also argue to learn equivariant representations by taking advantage of the multi-view data that can be seamlessly collected using multiple cameras.
    \item Our empirical evaluations show that \gazeclr{} yields improvements for various settings of gaze estimation and is competitive with existing supervised~\citep{park2019few} and unsupervised state-of-the-art gaze representation learning methods~\citep{yu2020unsupervised, sun2021cross}.
\end{enumerate}