\documentclass[pmlr]{jmlr}% new name PMLR (Proceedings of Machine Learning Research)

 % The following packages will be automatically loaded:
 % amsmath, amssymb, natbib, graphicx, url, algorithm2e

 %\usepackage{rotating}% for sideways figures and tables
\usepackage{longtable}% for long tables

 % The booktabs package is used by this sample document
 % (it provides \toprule, \midrule and \bottomrule).
 % Remove the next line if you don't require it.
\usepackage{booktabs}
 % The siunitx package is used by this sample document
 % to align numbers in a column by their decimal point.
 % Remove the next line if you don't require it.
\usepackage[load-configurations=version-1]{siunitx} % newer version
 %\usepackage{siunitx}

\usepackage{amsmath}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{graphicx}
\pagenumbering{gobble}
\usepackage{verbatim}
\usepackage{color}
\usepackage{booktabs}
\usepackage{bbding}
\usepackage[flushleft]{threeparttable}
\usepackage{makecell}
\usepackage{enumitem}
\usepackage{multirow}

\usepackage{wrapfig}
\usepackage{xspace}
% \usepackage{subcaption}


\newcommand*{\eg}{e.g.\@\xspace}
\newcommand*{\ie}{i.e.\@\xspace}
\newcommand*{\etal}{et al.\@\xspace}
\newcommand{\figref}[1]{Fig.~\ref{#1}}
\newcommand{\secref}[1]{Sec.~\ref{#1}}
\newcommand{\tabref}[1]{Table~\ref{#1}}

 % The following command is just for this sample document:
\newcommand{\cs}[1]{\texttt{\char`\\#1}}

 % Define an unnumbered theorem just for this sample document:
\theorembodyfont{\upshape}
\theoremheaderfont{\scshape}
\theorempostheader{:}
\theoremsep{\newline}
\newtheorem*{note}{Note}

 % change the arguments, as appropriate, in the following:
\jmlrvolume{1}
\jmlryear{2022}
\jmlrworkshop{NeurIPS 2022 Gaze Meets ML Workshop}

\title[SecNet]{SecNet: Semantic Eye Completion in Implicit Field}

 % Use \Name{Author Name} to specify the name.

 % Spaces are used to separate forenames from the surname so that
 % the surnames can be picked up for the page header and copyright footer.
 
 % If the surname contains spaces, enclose the surname
 % in braces, e.g. \Name{John {Smith Jones}} similarly
 % if the name has a "von" part, e.g \Name{Jane {de Winter}}.
 % If the first letter in the forenames is a diacritic
 % enclose the diacritic in braces, e.g. \Name{{\'E}louise Smith}

 % *** Make sure there's no spurious space before \nametag ***

 % Two authors with the same address
 % \author{\Name{Author Name1\nametag{\thanks{with a note}}} \Email{abc@sample.com}\and
 %  \Name{Author Name2} \Email{xyz@sample.com}\\
 %  \addr Address}
   
  \author{
  \Name{Yida Wang\nametag{\thanks{Work was done during internship in Facebook Reality Labs.}}} \Email{yida@fb.com} \\
  % \addr Technische Universität M\"unchen \\
  \Name{Yiru Shen\nametag{\thanks{Work was done in Facebook Reality Labs.}}}  \Email{shenyirustar@gmail.com} \\
   %\addr Cruise Automation \\
  \Name{David Joseph Tan}  \Email{djtan@google.com} \\
  % \addr Google Inc. \\
  \Name{Federico Tombari}  \Email{tombari@in.tum.de}\\
  % \addr Technische Universität M\"unchen \\
  \Name{Sachin Talathi}  \Email{stalathi@fb.com} \\
   %\addr Facebook Reality Labs \\
  }

 % Three or more authors with the same address:
 % \author{\Name{Author Name1} \Email{an1@sample.com}\\
 %  \Name{Author Name2} \Email{an2@sample.com}\\
 %  \Name{Author Name3} \Email{an3@sample.com}\\
 %  \Name{Author Name4} \Email{an4@sample.com}\\
 %  \Name{Author Name5} \Email{an5@sample.com}\\
 %  \Name{Author Name6} \Email{an6@sample.com}\\
 %  \Name{Author Name7} \Email{an7@sample.com}\\
 %  \Name{Author Name8} \Email{an8@sample.com}\\
 %  \Name{Author Name9} \Email{an9@sample.com}\\
 %  \Name{Author Name10} \Email{an10@sample.com}\\
 %  \Name{Author Name11} \Email{an11@sample.com}\\
 %  \Name{Author Name12} \Email{an12@sample.com}\\
 %  \Name{Author Name13} \Email{an13@sample.com}\\
 %  \Name{Author Name14} \Email{an14@sample.com}\\
 %  \addr Address}


 % Authors with different addresses:
 % \author{\Name{Author Name1} \Email{abc@sample.com}\\
 % \addr Address 1
 % \AND
 % \Name{Author Name2} \Email{xyz@sample.com}\\
 % \addr Address 2
 %}

\editor{Editor's name}
 % \editors{List of editors' names}

\begin{document}

\maketitle

\begin{abstract}
If we take a depth image of an eye, noise artifacts and holes significantly affect the depth values on the eye due to the specularity of the sclera. 
This paper aims at solving this problem through semantic shape completion.
We propose an end-to-end approach to train a neural network, called \emph{SecNet} (semantic eye completion network), that predicts a point cloud with an accurate eye-geometry coupled with the semantic labels of each point. These labels correspond to the essential eye-regions, \ie pupil, iris and sclera.
Particularly, our work performs implicit estimation of the query points with semantic labels where both the semantic and occupancy predictions are trained in an end-to-end way. 
To evaluate the approach, we then use the synthetic eye-scans rendered in UnityEyes simulator environment.
Compared to the state of the art, the proposed method improves the accuracy for shape-completion for 3D eye-scan by 8.2\%. 
In practice, we also demonstrate the application of our semantic eye completion for gaze estimation.%
\end{abstract}
\begin{keywords}
Eye completion, Implicit Field, Semantic Completion
\end{keywords}

\section{Introduction}

% \maketitle

\begin{wrapfigure}{r}{0.55\linewidth}
\centering
\includegraphics[width=\linewidth]{images/teaser.pdf}
\caption{Given a partial scan of an eye in (a),
our semantic completion in (b) reconstructs the fine-grained eye surface where each point is semantically labeled.}\label{fig:teaser}
\end{wrapfigure} 

Video-oculography (VOG) has gained popularity in recent years as a method for eye-tracking~\citep{van2002recording, larrazabal2019video, nair2020rit}. The main elements of VOG are egocentric cameras that capture images of eye, which  then undergo  image-processing techniques to extract the eye movement information. 3D VOG systems on the other hand also extract torsional eye-position using iris and pupil landmarks~\citep{goni2004robust}. As such, the accuracy of pupil tracking is central to the performance of VOG system, which can be significantly hampered by occlusions. 

Methods such as ellipse fitting~\citep{fitzgibbon1999direct}, RANSAC outlier removal~\citep{jian2010two} and moving average filtering~\citep{7886728}, and more advanced methods such as circular Hough transforms~\citep{cherabit2012circular} for extreme pupil occlusions~\citep{8528286} have in particular been found useful to solve the pupil occlusion problems. However, in recent years, several algorithmic approaches that leverage 3D eye structures~\citep{liu2021robust, liu20203d} have been proposed for pupil tracking in the presence of occlusions.

Our work is focused on utilizing the 3D eye regions for pupil tracking. We leverage recent advances in 3D machine learning to reconstruct the precise 3D structure of the eye region to fill out the occluded regions. As shown in~\figref{fig:teaser}, shape completion is carried out on the partial scan of eye to fill out the occluded eye regions.

Several works in recent years have addressed the problem of 3D shape completion using learning based methods. These methods can be classified based on the data-format for 3D scans. The most popular data-formats include volumetric~\citep{song2017semantic, dai2018scancomplete},
meshes~\citep{Groueix_2018_CVPR, wei2021deep}, point cloud~\citep{chang2015shapenet, dai2017scannet} and implicit representation~\citep{park2019deepsdf, erler2020points2surf, chibane20ifnet}.
%
Among them, the implicit 3D reconstruction frameworks such as DeepSDF~\citep{park2019deepsdf}, IF-Net~\citep{chibane20ifnet} and Points2Surf~\citep{erler2020points2surf} provide high resolution 3D shape-completion by estimating the implicit values for random 3D query points. To estimate the implicit values, point-wise feature extractors such as PointNet features~\citep{qi2017pointnet}
 and PointNet++ features~\citep{qi2017pointnet++} are commonly used.

% \begin{figure}[!t]
% \centering
% \includegraphics[width=0.6\linewidth]{images/teaser.pdf}
% \caption{Given a partial scan of an eye in (a),
% our semantic eye completion in (b) reconstructs the surface with finer details where each point is semantically labelled. 
% %   We then mask out the region behind the surface to capture the region of interest in (c).
% %   we complete it with fine details and segmentations in semantic implicit filed (b), which could be further processed as a semantically completed eye surface (c).
% % \djt{While (b) is important for the methodology, it is confusing in the teaser. Also, empty in point cloud is not important (used for voxels), right? \yida{yes empty is not important}}
% }
% \label{fig:teaser}
% \end{figure}



\begin{wrapfigure}{l}{0.55\linewidth}
\centering
\vspace{-5pt}
\includegraphics[width=\linewidth]{images/etra_gazing.pdf}
\vspace{-15pt}
\caption{Nine gaze directions.}\label{fig:gazings}
\vspace{-5pt}
\end{wrapfigure} 

To solve the efficiency issue caused by $k$-nearest neighbour search in the PointNet++ feature space, this paper adopts SoftPool~\citep{wang_softpool} feature as a local descriptor to construct an end-to-end model, called \emph{SecNet}, to estimate implicit code for each given 3D query point. With the additional information of semantics used during training, our model is able to perform semantic completion in an implicit field of the eye region to precisely complete the eye surface geometries, coupled with semantics including the sclera, iris and pupil.
 
The training dataset for SecNet depends on the paired 2D partial scan and 3D ground truth. We create our dataset by synthesizing eye-scans using the  UnityEyes~\citep{wood2016learning} simulator. We then render the 3D eye-scans of 1,000 distinct people, each fixating on nine different gaze points as shown in \figref{fig:gazings}. 

Empirically, our proposed the semantic implicit completion model is validated on this eye region dataset, achieving state-of-the-art performance at reconstructing semantic geometries. Moreover, we empirically demonstrate that the accurate reconstruction of the completed eye region is helpful for gaze estimations. %This application is important in perceiving the user's current interest towards the environment for safety concern like driver action inspection~\citep{bar2012driver} and AR/VR applications.




\section{Related works}

This section focuses on the more general related work on 3D completion and semantic completion. In addition, since we are proposing to use the semantic eye completion for gaze estimation, we also discussed the related works that  relied on depth images to estimate the gaze direction. 


\subsection{3D completion}

There are three different ways of completing a shape from a partial scan as shown in \figref{fig:qualitatives}: volumetric grid, point cloud and implicit surface. Early works using deep learning have relied on volumetic reconstruction because of its similarity to images, which allowed them to extend the convolution operation to 3D. For instance, 3D-EPN~\citep{dai2017shape} takes TSDF volumes~\citep{werner2014truncated} as input and builds an encoder-decoder structure using 3D convolution. SSCNet~\citep{song2017semantic} proposed to use flipped TSDF as input to perform semantic segmentation and completion at the same time. 
To solve the lack of 3D annotations, ForkNet~\citep{wang2019forknet} proposed to use the discriminator to synthetically generate new pairs of partial scan and its corresponding completed reconstruction.
The main issue in such approaches is that storing 3D data in a dense volumetric grid~\citep{song2017semantic, dai2018scancomplete} consumes too much disk space and slows down inference speed for down-stream applications~\citep{dai2018scancomplete}.

\begin{figure*}[!t]
\centering
\includegraphics[width=1.0\linewidth]{images/IROS_qualitatives.pdf}
\caption{Given the input partial scan in (a) and the ground truth in (h), we compare different representations for semantic eye completion such as volumetirc data (b, c), point cloud (d, e) and implicit surface (f, g). 
Note that, except for (c) and (g) that directly infer the semantic completion, the approaches are segmented by 3D-GCN~\citep{lin2020convolution} to predict the semantic labels.
\label{fig:qualitatives}
}
\end{figure*}

Point clouds were the less popular choice because of its unorganized structure. Notably, unlike volumetric data, we cannot easily apply the 3D convolution operations on them. 
To handle this issue, PointNet~\citep{qi2017pointnet} proposed a solution that uses max-pooling operations to make the feature permutation invariant so that the order of the points going through the architecture does not matter. Such feature was initially proposed for 3D object classification and segmentation, which was later used in point cloud completion in FoldingNet~\citep{yang2018foldingnet}, PCN~\citep{yuan2018pcn} and AtlasNet~\citep{Groueix_2018_CVPR}.

PointNet feature, however, lacks the ability to describe the local geometry in the point cloud. 
This motivated the extended version PointNet++~\citep{qi2017pointnet++} that uses $k$-nearest neighbor search to describe the local structure.
%
SoftPoolNet~\citep{wang_softpool}, on the other hand, is also motivated by the same objective but avoids running the time-consuming $k$-nearest neighbor search. Instead, this method proposes to use trainable parameters to sort the points through the feature dimension. 


% Given the fact that storing 3D data in a dense volumetric grid~\citep{song2017semantic, dai2018scancomplete} consumes too much disk space and slows down inference speed for down-stream applications~\citep{dai2018scancomplete}, sparse 3D data including point cloud and implicit reconstruction are more commonly used than volumetric data for 3D shape completion applications.
% One the one hand, point cloud offer the ability to change local resolution depending on the details of the structure being represented. However, point clouds suffer from their own drawbacks, including undefined local neighborhood and unorganized feature maps, making processing with 3D convolutions difficult.
% Several recent works have focused on solving the issues identified above with point cloud data processing.
% %\textcolor{red}
% One of the most common solution to above problem is based on using point cloud features extracted using PointNet ~\citep{yang2018foldingnet, Groueix_2018_CVPR, yuan2018pcn, liu2020morphing, grnet_xie, wang_softpool, wen2020pmp} based on PointNet feature lacks of local descriptions for further completion, while PointNet feature's local version PointNet++ feature~\citep{qi2017pointnet++} used in PMP-Net~\citep{wen2020pmp}, PointConv~\citep{wu2019pointconv} and PointCNN~\citep{li2018pointcnn} is inefficient in terms of the $K$-nearest neighbor search for completion~\citep{Wen_2020_CVPR, wen2020pmp}. Recently, SoftPoolNet~\citep{wang_softpool} proposes to sort points along certain feature dimension using learning based methods, which solves the efficiency problem of local descriptors. 

As we can observe in \figref{fig:qualitatives}, completion with implicit surface generates smoother reconstruction with significantly less noise compared to volumetric and point clouds.
%
Although their input is also based on point cloud features~\citep{erler2020points2surf, guerrero2018pcpnet}, implicit 3D reconstruction such as DeepSDF~\citep{park2019deepsdf}, IF-Net~\citep{chibane20ifnet} and Points2Surf~\citep{erler2020points2surf} creates a fine-grained 3D shape by estimating an object surface which distinguishes the inner and outer space. Such format not only produces smoother surface reconstruction, but also reveals more local structural details compared to traditional mesh reconstruction approaches such as screened poisson reconstruction (SPR)~\citep{kazhdan2013screened}.

% Based on point cloud features, implicit 3D reconstruction such as DeepSDF~\citep{park2019deepsdf}, IF-Net~\citep{chibane20ifnet} and Points2Surf~\citep{erler2020points2surf} reconstruct a fine-grained 3D shape by estimating an object surface which distinguishes the inner and outer space. Such format not only produces smoother surface reconstruction, but also reveals more local structural details compared to traditional mesh reconstruction approaches such as screened poisson reconstruction (SPR)~\citep{kazhdan2013screened}.

\subsection{Semantic completion} 

% \djt{Are we focusing on segmentation here or semantic completion?} \yida{semantic completion}
While several methods focus on completion alone~\citep{dai2017shape, park2019deepsdf, dai2018scancomplete, yuan2018pcn},
%given a single partial scan in different output data formats,
%\eg implicit surface~\citep{dai2017shape, park2019deepsdf}, volumetric data~\citep{dai2018scancomplete} and point cloud~\citep{yuan2018pcn}. 
there are other methods which simultaneously infer the semantic labels with the geometric completion~\citep{song2017semantic, wang2019forknet, wang3dv}. For instance, SSCNet~\citep{song2017semantic} uses 3D dilated convolutions to build an encoder-decoder architecture to predict semantic and occupancy value for each voxel in a predefined 3D grid. ForkNet~\citep{wang2019forknet} proposes a decoder with three branches which are able to generate realistic newly paired partial scan and its semantic completion to train the entire network for semantic completion. 
% Notice that the output of SSCNet~\citep{song2017semantic} and ForkNet~\citep{wang2019forknet} is defined as a one-hot code, which inspired the proposed semantic implicit code proposed in our paper. 

For methods that perform completion alone~\citep{dai2017shape, wang_softpool, erler2020points2surf}, one solution to gain semantic labels is to attach a segmentation framework, \eg PointNet~\citep{qi2017pointnet} and PointCNN~\citep{li2018pointcnn}, after the geometric completion.
% In some scenarios, semantics are also available for training. 
Recently, some features are proposed for point cloud segmentation to exploit local neighbourhood such as PointNet++~\citep{qi2017pointnet++} focusing on extracting features from local point groups. Also using the nearest neighbour search for feature extraction, KCNet~\citep{shen2018mining} further aggregates the local features to investigate more complex relationships. KPConv~\citep{thomas2019kpconv} and 3D-GCN~\citep{lin2020convolution} make the kernel of the point cloud convolution deformable to generate better matches with different local geometries for segmentation. Although implicit reconstruction is already well investigated, an implicit reconstruction with semantic estimation is not explored. In this paper, we are proposing one of the first few works on semantic implicit completion. 


\subsection{Gaze estimation}

There are different ways to estimate the gaze, especially when using RGB images~\citep{jianfeng2014eye}.
%
However, in this work, we limit the scope to the utilization of 3D data which is less investigated. 

% For instance, \citep{goni2004robust} binarize the input RGB image \djt{.......?}. 
%
%
% For the aim of benchmarking gaze estimation approaches with 2D input, SynthesEyes~\citep{wood2015rendering} synthesizes large amount of 2D images with realistic illumination for benchmarking gazing estimation approaches. 

To determine the direction of the user's gaze, one of the most important parameters is the position of the pupil, while the other is the eye's spherical center.
%
Given the 3D structure, EMGE~\citep{zhou20163d} proposes to estimate the entire spherical eyeball structure by estimating several points in pupil while RTGE~\citep{sun2015real} locates the pupil by fitting a circle to 2D scans which is back-projected to 3D for the gaze estimation.
To the best of our knowledge, we are the first approach that uses semantic eye completion to perform gaze estimation.

% If the 3D structures are given, \citep{jianfeng2014eye} proposed to compute the gaze direction by estimating the center of the iris and eyeball. 


% If the 3D structures are given~\citep{jianfeng2014eye}, Gazing direction could be estimated by estimating center of iris and eyeball. EMGE~\citep{zhou20163d} proposes to estimate the whole spherical eyeball structure by estimating several points in pupil firstly, which makes it possible to do gaze estimation from a partial scan. RTGE~\citep{sun2015real} locates the pupil by fitting circle to 2D scans which is further back-projected to 3D for gazing estimation.

% \subsection{Semantic eye completion dataset}
% \djt{why do you say ``real" here? Why didn't you add U2Eyes~\citep{porta2019u2eyes} and Rit-eyes~\citep{nair2020rit} to the first sentence?}
% Aiming at solving real applications such as gaze estimation, datasets~\citep{garbin2020dataset} and tools~\citep{jogeshwar2020analysis} are proposed to train and evaluate the end-to-end models with eye scans as input.
% \djt{Can you describe these datasets? What is the input?} \yida{sentence modified}
%
% One limitation of having real dataset is that the geometry might be with defects and occlusion which will be problematic for model optimization of gazing estimation.

With our solution, we encountered an issue in finding %one issue that we encountered is that we cannot find 
the appropriate dataset to train our models. 
The publicly available datasets do not provide the paried 3D partial scans and their completion. Most datasets focus on RGB images such as SynthesEyes~\citep{wood2015rendering} which synthesizes 2D eye images with realistic illumination. 
This then prompt us to build and publish a new dataset using the UnityEyes~\citep{wood2016learning} simulator engine.
% Since we found that some synthetic data such as U2Eyes~\citep{porta2019u2eyes} and Rit-eyes~\citep{nair2020rit} include complete eye regions. 
% To make our synthetic eye data more realistic, SynthesEyes~\citep{wood2015rendering} synthesizes 2D eye images with realistic illumination. 
% We noticed that these datasets do not provide the paried 3D partial scans and their completion.
%
%
Similar to ShapeNet~\citep{chang2015shapenet} for objects and ScanNet~\citep{dai2017scannet} for scenes, we then propose a method to build dataset based on eye meshes such as UnityEyes~\citep{wood2016learning} specializing on semantic eye completion. 

% Existed real datasets~\citep{garbin2020dataset} and tools~\citep{jogeshwar2020analysis} allow for using end-to-end models for gaze estimation applications. Works like Privacy-Aware Tracker~\citep{steil2019privacy} makes eye tracking applications more acceptable for the public by solving the privacy related issues. For the aim of training eye completion model, we notice that existed datasets like U2Eyes~\citep{porta2019u2eyes} and Rit-eyes~\citep{nair2020rit} do not provide partial scan. So that in this paper, we construct our own dataset for semantic eye completion UnityEyes~\citep{wood2016learning} simulator engine.

\section{Methodology}

The input to the framework is a partial scan captured by a depth camera pointing towards the eye. With $N_\text{scan}$ points, each with $(x, y, z)$ coordinate, we denote the partial scan as a point cloud $\mathcal{P}_\text{scan}$ which is represented as an $N_\text{scan} \times 3$ feature map.
In practice, these scans are affected by noise and self-occlussions, \eg from the eyelid and eyelashes. 
The objective then is to fix these issues and build a completed point cloud $\mathcal{P}_\text{eye}$.

% A single partial scan captured by depth camera towards the eye region could be presented as a point cloud $\mathcal{P}_\text{scan}$ with $N_\text{scan}$ points which is usually described by a feature map with the shape of $N_\text{scan} \times 3$, where each single point is presented by its $[x, y, z]$ coordinate.
% %
% The given $N_\text{scan}$ points in $\mathcal{P}_\text{scan}$ is practically not satisfying in terms of its surfaces because of self-occlusion, \textit{e.g.} occlusion of eye lashes and eye lips, a more corrected and completed point cloud $\mathcal{P}_\text{eye}$ without occluded structures is expected to get estimated by an inference model $\mathcal{G}$ to fix defects points and occluded regions $\mathcal{P}_\text{eye} = \mathcal{G}(\mathcal{P}_\text{scan})$.
% Practically, $\mathcal{G}$ can be implemented by many popular point cloud completion approaches including some efficient models such as FoldingNet~\citep{yang2018foldingnet} and PCN~\citep{yuan2018pcn}, and models with high accuracy such as PoinTr~\citep{yu2021pointr} and SnowflakeNet~\citep{xiang2021snowflakenet}.
%

Since semantic supervision is available for training, we also predict the semantic labels that includes the skin, sclera, iris and pupil. While some methods~\citep{lin2020convolution} utilize another inference model $\mathcal{S}$ such that the segmentation is predicted separately from the completion as $S_\text{eye} = \mathcal{S}(\mathcal{P}_\text{eye})$, we propose to use a single model to infer the semantic eye completion as $\mathcal{G}_\mathcal{S}(\cdot)$. This therefore estimates the geometry and the semantics at the same time.

% If semantic supervisions are available for training including skin, sclera, iris and pupil, the semantic label could be predicted accordingly with another inference model $\mathcal{S}$ such as 3D-GCN~\citep{lin2020convolution}, which forms the estimation procedure of completion $\mathcal{P}_\text{eye} = \mathcal{G}(\mathcal{P}_\text{scan})$ and segmentation $S_\text{eye} = \mathcal{S}(\mathcal{P}_\text{eye})$. Such estimating procedure is validated as baseline performance in the experiment section. 


\subsection{Semantic implicit fields}



Moreover, we take advantage of the particular problem at hand. Notably, the similarities of the eye structures across different individuals and different movements allow us to effectively use implicit reconstruction. This then solves the limitation from unstructured point cloud in terms of structural accuracy and the limitation from voxel grids in terms of reconstruction resolution; consequently, leading to a reduction of noise in the reconstruction with high resolution. 
%
This is validated in \figref{fig:qualitatives}.
In addition, we noticed that our method even produced a denser and smoother reconstruction than the ground truth.



% Because point cloud based procedures are limited in both structural accuracy and reconstruction resolutions. We propose to do semantic completion with the help of the proposed \emph{semantic implicit field} (SIF) which is demonstrated in following sections.

Inspired by the works of IF-Net~\citep{chibane20ifnet} and Points2Surf~\citep{erler2020points2surf}, we also learn implicit values between a set of query points and the mesh surface. For each query point, these methods predict a value between $-1$ to $1$, where a query point on the surface is at the zero-crossing.
The difference between traditional implicit surface learning and our model is that we propose to present the semantic labels in addition to the geometry, which we call \emph{semantic implicit field} (SIF). Therefore, in our work, given an arbitrary query point $p_\text{query}$, we can simplify the framework to a classification task where the architecture predicts if the point is an \emph{empty} space, or part of the \emph{skin}, \emph{sclera}, \emph{iris} or \emph{pupil}. We refer the classification result as the \emph{semantic code} $c_\text{query}$ of the query point such that $c_\text{query} = \mathcal{G}_\mathcal{S}(\mathcal{P}_\text{scan}, p_\text{query})$.
%
This implies that the output eye reconstruction $\mathcal{P}_\text{eye}$ is presented by all the query points that are not empty. 
%
% \djt{Do you think it makes sense to describe how you select the query points?} \yida{added}
% In practice, assuming that the partial scan is normalized, we sample query points randomly around the partial scan within a surrounding space defined by Chamfer distance of 0.3 between query point and the input point cloud.
%
In practice, assuming that the partial scan is normalized to a unit cube, we sample the query points randomly around the partial scan within a Chamfer distance of $0.3$.





\begin{wraptable}{r}{0.62\linewidth}
% \begin{table}
\centering
\includegraphics[width=1.\linewidth]{images/ETRA_architecture.pdf}
\resizebox{1.\linewidth}{!}
{
\begin{tabular}{r|l}
% \multicolumn{2}{c}{SecNet \small{(Transformer)}} 
\toprule	
\textbf{Modules} & \textbf{Parameters} \\
\midrule
MLP (encoder) & $D_\text{out} = [512, 512, 8]$  \\
\midrule
Soft Pooling & $N_r$ = 32, $N_f$ = 8  \\
\midrule
Regional Convolution & $N_p$ = 8, $D_\text{out} = 8$, $D_\text{kernel} = [8, 8]$ \\
\midrule
2D Convolution & $D_\text{out}$ = 64, $D_\text{kernel} = [256, 8]$ \\
\midrule
Positional Coding & $D_\text{out} = 64$ \\
\midrule
MLP (decoder) & $D_\text{out} = [16, 32, 64, 5]$ \\
\bottomrule
\end{tabular}
}

\caption{Architecture of SecNet with the corresponding hyperparameters for each module where $D_\text{out}$ represents the output dimensions.
\label{tab:architecture}}
%\end{table}
\end{wraptable}


\subsection{SecNet architecture}

The architecture for $\mathcal{G}_\mathcal{S}(\cdot)$ is summarized in \tabref{tab:architecture}, where we build an encoder-decoder structure.
Here, the encoder processes the partial scan $\mathcal{P}_\text{scan}$ and produces the latent feature that describes the global structure. Having the latent feature and a query point $p_\text{query}$, the decoder runs the implicit estimation that finds the semantic code $c_\text{query}$ which classifies whether the point is empty or the specific part of the eye.

In particular, the encoder first randomly sub-samples the $\mathcal{P}_\text{scan}$ into $N_\text{in}=2048$ points to have a constant tensor as input. These points are fed to a 3-layer MLP that generates an output dimension of 8. 
%
We then use the SoftPool~\citep{wang_softpool} operation with the number of regions $N_f=8$ and the number of regional points $N_r=32$, which produces an 8-region feature map with the shape of $[256, 8]$.
%
This is processed by a regional convolution~\citep{wang_softpool} with a kernel size $D_\text{kernel} = [N_p, N_f]$ for all 8 regions, which covers $N_p = 8$ points with zero padding in each region. We then add a 2D convolution with kernel size $D_\text{kernel} = [N_r \times N_f, N_f]$, resulting in a 64-dimensional vector as the latent feature. 
% \yida{Modified} 
%
Since the encoder is only dependent on the partial scan, the latent feature is constant across all the query points in the decoder. 
%


% To estimate each semantic code regarding to each single query point $p_\text{query}$, we construct model $\mathcal{G}_{\text{S}}$ concerning to the whole partial scan $\mathcal{P}_\text{scan}$. 
% $\mathcal{P}_\text{scan}$ is first resampled to 2,048 points.
% Then they are fed into a 3-layer MLP that generates the an output neuron with a dimension 8. 
% %
% To present both global feature and local feature of the whole partial scan $\mathcal{P}_\text{scan}$, we then perform soft pooling~\citep{wang_softpool} which produces a feature map with the shape of $[256, 8]$ by setting number of regional points $N_r$ to 32. Next, such feature is processed by a 2D convolution~\citep{wang_softpool}, resulting in a 64-dimension feature vector to describe $\mathcal{P}_\text{scan}$.


Every query point in the decoder is converted to a positional code using SIREN~\citep{sitzmann2019siren}, having the same dimension as the latent feature.
%
Thereafter, the sum of the positional coding and the latent feature serves as the input to the 4-layer MLP to estimate the final semantic code $c_\text{query}$ with a softmax activation.

% Such feature is processed with regional convolution summed up with a 64-dimension positional coding~\citep{sitzmann2019siren} of a query point $p_\text{query}$ which is fed towards remaining network architecture to estimate implicit code. The parameter initialization is set to be the same as described in SIREN~\citep{sitzmann2019siren}.

% Finally, for the decoder, we use a 4-layer MLP to get the final implicit code $c_\text{query}$. The detailed network parameters are listed in~\tabref{tab:architecture}.


We summarize the numerical values of our architecture in \tabref{tab:architecture}. This table shows the architecture on top and the corresponding parameters for each layer at the bottom.


To train the proposed encoder-decoder model, we impose the per-category binary cross entropy $\epsilon_c(\cdot,\cdot)$ such that 
%
\begin{align}
        \mathcal{L}_{\text{semantic}} 
        = \sum_{c} 
        \epsilon_c(c_\text{query}, c_\text{gt}) 
        %  \nonumber \\
        = \sum_{c} 
        \epsilon_c\left(\mathcal{G}_{\text{S}}(\mathcal{P}_\text{scan}, p_\text{query}), c_\text{gt}\right)
        \label{eq:l_implicit_field}
\end{align}
%
sums up the entropy for all categories.
%
Given this loss function, we train the model $\mathcal{G}_\mathcal{S}$ with a batch size of 64. We employ the Adam optimizer~\citep{kingma2014adam} with a learning rate of 0.0001 while the exponential decay rates $\beta_1$ and $\beta_2$ are set to 0.9 and 0.999, respectively.


% \subsection{Semantic implicit eye}

% Instead of infer geometric completion and semantic segmentation in a cascaded way with point cloud approaches, we do semantic eye completion with a single model $\mathcal{G}_{\text{S}}(\mathcal{P}_\text{scan})$ which estimates the completed geometry and semantics at the same time. 
% Instead of produce expected 3D geometry $\mathcal{P}_\text{eye}$ directly from $\mathcal{P}_\text{scan}$ via popular point cloud encoder-decoder architectures, we investigate a set of query points $\mathcal{P}_\text{query}$ around $\mathcal{P}_\text{scan}$ to check whether each single point $p_\text{query}^i \in \mathcal{P}_\text{query}$ should be used to present the expected shape.

% The way we check every query points is referring to their semantic implicit code $\mathcal{C}_\text{query}$, which formulates geometry $\mathcal{P}_\text{query}$ and semantics $S_\text{query}$ as a softmax code with feature dimensionality of $N_\text{cat}+1$, where $N_\text{cat}$ represent the number of semantic categories. More specifically for eye reconstruction task, skin, sclera, iris and pupil altogether with empty space form a 5-dimensional softmax code $c_\text{query}^i$ for each query point, where the first digit is in charge of presenting the empty space. 

% %there will be a set of softmax codes $\mathcal{C}_\text{query} = \{\mathcal{C}_\text{eye}, \mathcal{C}_\text{empty}\}$ getting predicted accordingly, where $\mathcal{C}_\text{eye}$ represent the semantic surface. 
% %\begin{align}
% %    \mathcal{C}_\text{eye} = [\mathcal{P}_\text{eye}, S_\text{eye}] = %\mathcal{G}_{\text{S}}(\mathcal{P}_\text{scan}) ~,
% %\end{align}

% Inspired by works such as IF-Net~\citep{chibane20ifnet} and Points2Surf~\citep{erler2020points2surf} which learn implicit values between query points and mesh surfaces with range between -1 and 1, our proposed model also predicts implicit values in $c_\text{query}^i$. The difference between our model and traditional implicit surface learning is that our proposed \emph{semantic implicit field} (SIF) also presents semantic labels. 
% The values of each single dimension of code $c_\text{query}^i$ ranges between 0 and 1 which reveals the confidence score of point $p_\text{query}^i$ belonging to certain categories.
% %
% By sampling enough number of points $\mathcal{P}_\text{query}$, a completed eye could be defined by subset $\{p_\text{eye}\} \in \mathcal{P}_\text{query}$.
% %
% The loss function used to optimize model $\mathcal{G}_{\text{S}}$ for points in semantic implicit field (SIF) is then defined as a binary cross entropy
% \begin{align}
%         \mathcal{L}_{\text{semantic}} &= \sum_{i=1}^{N_\text{cat}+1}\epsilon(c_\text{query}, c_\text{gt}) \nonumber \\
%         &= \sum_{i=1}^{N_\text{cat}+1}\epsilon(\mathcal{G}_{\text{S}}(p_\text{query}), c_\text{gt})
%         \label{eq:l_implicit_field}
% \end{align}
% %
% to train the inference network,
% where $\epsilon(\cdot,\cdot)$ is the per-category error
% %
% \begin{align}
%         \epsilon(q,r) = -r \log q - (1 - r)\log(1 - q)~.
%         \label{eq:per_category_error}
% \end{align}

% So that eventually the eye region would be presented as
% \begin{align}
%         \mathcal{P}_\text{eye} = \{p_\text{query}^i\}_{i=1}^{N_\text{query}} ~~ \; \text{if} ~~\;  \text{argmax}(c_\text{query}^i) \neq 1~,
%         \label{eq:eye_region}
% \end{align}
% where $\text{argmax}$ returns the index of the dimensionality with the largest value.



% \subsection{Implicit eye surface}

% \begin{figure*}[!t]
% \centering
% \includegraphics[width=1.0\linewidth]{images/ETRA_masking.pdf}
% \caption{Masked semantic eye completion.}
% \label{fig:masking}
% \end{figure*}

% As shown in ~\figref{fig:masking} (a), the reconstructed semantic completion of eye region is with too much false positive skin queries behind the facial surface. To have a further analysis between empty space and the eye surface, we present the geometric implicitly with a code with 2 digits $w_1$ and $w_2$ where
% \begin{align}
%         w_1 &= \frac{c_\text{query}^1}{c_\text{query}^1 + \frac{1}{D} \sum^{D+1}_{i=2} c_\text{query}^i} \nonumber \\
%         w_2 & = \frac{\frac{1}{D} \sum^{D+1}_{i=2} c_\text{query}^i}{c_\text{query}^1 + \frac{1}{D} \sum^{D+1}_{i=2} c_\text{query}^i} = 1 - w_1 ~,
%         \label{eq:implicit}
% \end{align}
% so that $w_1$ and $w_2$ sum up to 1. 
% By learning the implicit digit $w$ with the Euclidean distance to regress the ground truth implicit value of query point
% \begin{align}
%         \mathcal{L}_{\text{surface}} &= \left\|(w_1 - \tanh(d_\text{surf}(p_\text{query}))\right\|^2_2 ~,
%         \label{eq:l_implicit_surface}
% \end{align}
% where $d_\text{surf}(p_\text{query})$ is the distance of query point $p_\text{query}$ to the ground-truth mesh surface. Our model is trained with both (\ref{eq:l_implicit_field}) and (\ref{eq:l_implicit_surface}) with equal weights.

% Eventually we weight the semantic implicit code $c_\text{query}$ by $w$, where empty dimension $c_\text{query}^1$ is weighted by $w_1$ and all remaining semantics dimensions are weighted by $w_2$. Practically, we implement such weighting procedure with a weighting vector $\mathcal{W}$ which is shown in \figref{fig:masking} (b) with the same number of elements in $C_\text{query}$ where all digits of a single element are equal to $w_2$ except for the first digit, which is $w_1$. \figref{fig:masking} (d) demonstrates the weighted final result of the semantically reconstructed eye surface $\mathcal{P}_\text{eye-surf}$
% \begin{align}
%         \mathcal{P}_\text{eye-surf} = \{p_\text{query}^i\}_{i=1}^{N_\text{query}} ~~\; \text{if} ~~ \; \text{argmax}(w^i \times c_\text{query}^i) \neq 1 ~.
%         \label{eq:eye_surface}
% \end{align}

% \subsection{Architecture}

% To estimate each semantic code regarding to each single query point $p_\text{query}$, we construct model $\mathcal{G}_{\text{S}}$ concerning to the whole partial scan $\mathcal{P}_\text{scan}$. 
% $\mathcal{P}_\text{scan}$ is first resampled to 2,048 points.
% Then they are fed into a 3-layer MLP that generates the an output neuron with a dimension 8. 
% %
% To present both global feature and local feature of the whole partial scan $\mathcal{P}_\text{scan}$, we then perform soft pooling~\citep{wang_softpool} which produces a feature map with the shape of $[256, 8]$ by setting number of regional points $N_r$ to 32. Next, such feature is processed by a 2D convolution~\citep{wang_softpool}, resulting in a 64-dimension feature vector to describe $\mathcal{P}_\text{scan}$.

% Such feature is processed with regional convolution summed up with a 64-dimension positional coding~\citep{sitzmann2019siren} of a query point $p_\text{query}$ which is fed towards remaining network architecture to estimate implicit code. The parameter initialization is set to be the same as described in SIREN~\citep{sitzmann2019siren}.

% Finally, for the decoder, we use a 4-layer MLP to get the final implicit code $c_\text{query}$. The detailed network parameters are listed in~\tabref{tab:architecture}.


% \begin{figure}[!b]
% \centering
% \includegraphics[width=0.6\linewidth]{images/etra_gazing.pdf}
% \caption{Nine gaze directions in the Sec-Eye Dataset}
% \label{fig:gazings}
% \end{figure}



\subsection{Gaze estimation}
\label{sec:gaze}

As a by-product of our semantic eye completion, we estimate the gaze direction through the semantic points.
Similar to RTGE~\citep{sun2015real}, we solve this problem by estimating a 3D vector from the center of the eyeball to the center of the pupil. 
% 
To find the centers, we use all the points on the sclera to fit a sphere that represents the eyeball; then, take the average point of all the points on the iris.
The gaze direction is finally estimated as the vector that connects the center of the sphere and the average point.
%
For the sphere, we use an energy optimization to estimate its  center $p_\text{center}= (x_c, y_c, z_c)$ as well as its radius $r$. This minimizes the loss
%
\begin{align}
        \mathcal{L}_{\text{eyeball}} = \sum_{i}^{N_\text{sclera}} \left| \left\| 
        p^i_\text{sclera} - p_\text{center}
        \right\|^2 - r^2 \right| ~,
        \label{eq:eyeball}
\end{align}
% \begin{align}
%         \mathcal{L}_{\text{eyeball}} = \sum_{i}^{N_\text{sclera}} \left( \left\|
%         p^i_\text{sclera} - p_\text{center}
%         \right\| - r \right)^2 ~,
% \end{align}
%
summing up the absolute errors from all the $N_\text{sclera}$ points labelled as sclera $p^i_\text{sclera}$ in the semantic eye completion.
%


% There are several ways to estimate the sphere such as a linear system or an energy optimization. 
% Since we have the ground truth eyeball centers from the training dataset, we implemented a single-layer linear projection to find the sphere's center $p_\text{center}= [x_c, y_c, z_c]^2$ and radius $r$.
% Similar to $\mathcal{P}_\text{scan}$, the input to this sub-architecture is the set of points on the sclera that are randomly sampled to $N_\text{sphere}=256$ points $\mathcal{P}_\text{sclera}$.
% The model is then trained with the MSE loss
% %
% % \begin{align}
% %         \mathcal{L}_{\text{eyeball}} = \sum_{i = 1}^{N_\text{sphere}} \| (p_\text{sclera}^i[1] - x)^2 + (p_\text{sclera}^i[2] - y)^2 + (p_\text{sclera}^i[3] - z)^2 - r^2\|^2 
% % \end{align}
% \begin{align}
%         \mathcal{L}_{\text{eyeball}} = 
%         \left\| 
%         p_\text{center} - p_\text{gt-center}
%         \right\|^2
% \end{align}
% %
% where $p_\text{center}$ and $p_\text{gt-center}$ are the estimated and ground truth eyeball centers, respectively.
% %
% In this way, the model is trained to adapt in cases when the reconstruction is not perfectly spherical.

% To find the eyeball center, we use all the points on the sclera to fit a sphere that represents the eyeball; We follow a point cloud registration pipeline using both global and local registration for efficiency and accuracy concern. First, a 33-dimensional point-wise FPFH~\citep{rusu2009fast} feature is calculated for RANSAC~\citep{choi2015robust} global registration. Then the ICP~\citep{} local registration is applied for a more accurate registration.


% Then, take the average point of all the points on the iris.
%
% \djt{How to find the sphere? We represent the sphere as ...}
% %
% The gaze direction is finally estimated as the vector that connects the center of the sphere and the average point.




% RTGE~\citep{sun2015real} defines the task of gazing estimation in terms of estimating center of pupil and eyeball. In this paper, the center of iris is estimated by taking the average 3D coordinates of all points on iris and pupil. Then the center of eyeball is estimated by fitting a 3D spherical ball to the sclera by 3D translation, where the center of such 3D spherical ball is used as the estimated center of eyeball. The gazing direction is then defined as the direction from the center of eyeball to center of iris.

% \djt{Can you describe the gaze estimation here?}

% \yida{added optimizer}
% \subsection{Network optimization}
% \label{sec:optimization}



% \begin{wrapfigure}{r}{0.45\linewidth}
% \centering
% \includegraphics[width=\linewidth]{images/etra_gazing.pdf}
% \caption{Nine gaze directions in the Sec-Eye Dataset.}\label{fig:gazings}
% \end{wrapfigure} 




% \begin{figure}
% \centering
% \begin{minipage}{.5\textwidth}
%   \centering
%   \includegraphics[width=0.95\linewidth]{images/etra_gazing.pdf}
%   \vspace{10pt}
%   \captionof{figure}{Nine gaze directions.}
%   \label{fig:test1}
% \end{minipage}%
% \begin{minipage}{.5\textwidth}
%   \centering
%   \includegraphics[width=\linewidth]{images/IROS_dataset.pdf}
%   \vspace{10pt}
%   \captionof{figure}{Examples of the Sec-Eye Dataset.}
%   \label{fig:dataset}
% \end{minipage}
% \end{figure}


% \section{\yida{Data augmentation}} 

\section{Dataset} 
\label{sec:dataset}

% \begin{figure}[!t]
% \centering
% \includegraphics[width=0.6\linewidth]{images/IROS_dataset.pdf}
% \caption{Some examples in the Sec-Eye Dataset.}
% \label{fig:dataset}
% \end{figure}


\begin{wrapfigure}{!t}{0.55\linewidth}
\centering
\vspace{-10pt}
\includegraphics[width=\linewidth]{images/IROS_dataset.pdf}
\vspace{-15pt}
\caption{Examples of our dataset.}\label{fig:dataset}
\vspace{-5pt}
\end{wrapfigure} 


% \djt{Since the dataset is a contribution, there must be a separate section for that.}

We generate the dataset by rendering pairs of partial scans and their corresponding semantic completion using the UnityEyes~\citep{wood2016learning} mesh models.
%
To generalize for the gaze estimation, we rotate the eyeball towards nine gaze directions as shown in \figref{fig:gazings}, including up-right, up, up-left, right, straight, left, down-right, down and down-left. 
%
With 1,000 identities from UnityEyes, we then have a total of 9,000 pairs in the  dataset.
%
For the experiments, we split our dataset with 800 identities for training and 200 for testing.
%
One of the main advantages in having depth images or partial scans as input is the privacy preservation in training or during inference.


\figref{fig:dataset} shows some examples of the process that the model undergo when creating the dataset. Since the dataset is synthetically rendered, we impose the defects and self-occlusion by randomly dropping 61.2\% points for pupil, 74.9\% of iris, 29.7\% of sclera and 9.0\% of skin on every mesh models as shown in \figref{fig:dataset}(b). 
In addition, the surface also incorporates jitter in order to mimic the sensor noise. The jitter is defined by a Gaussian distribution with zero mean and a 0.05 standard deviation for points on the sclera, iris and pupil. 
For this dataset, \figref{fig:dataset}(c) illustrates the example input partial scans that we use for this evaluation.
Noticeably, without visualizing the ground truth semantic labels in \figref{fig:dataset}(c), identifying the regions on the eye or the gaze direction from the three images becomes very difficult. 



% % To find a way to model defects and occlusions for real eye scans, we first rotate and translate synthetic UnityEyes eye models to get it aligned with 5 real eye scans collected with a multi-camera rig using efficient ICP~\citep{rusinkiewicz2001efficient} registration. 
% To model the defects and occlusions, we randomly drop 61.2\% points for pupil, 74.9\% of iris, 29.7\% of sclera and 9.0\% of skin on every mesh models. The partial scans are made by randomly dropping out certain percentage of points according to different semantics from the complete mesh models as shown in ~\figref{fig:dataset}. To test the noise tolerance capacity of our model, we add jitter effect on the surface as shown in ~\figref{fig:dataset} (b), where a Gaussian distribution zero mean and standard deviation of 0.05 for points on category of sclera, iris and pupil 
%

% To make the training data rich in terms of gaze directions, we render 3D meshes by moving the eyeball towards 9 general gaze directions as shown in \figref{fig:gazings} for each identity including up-right, up, up-left, right, straight, left, down-right, down and down-left. In total, for 1000 identities we synthesize up to 9000 synthetic eye meshes to generate the paired partial scan -- semantic implicit completion supervision dataset. For our experiments, we use 800 identities for training and 200 identities for testing.

% \djt{How do you change the identity if they are synthetic?}

% \djt{How can we justify that the noise level are similar to the real depth data?}


\section{Experiments}

\begin{table}[!t]
\centering
%\resizebox{\linewidth}{!}
%{
\begin{tabular}{c|l|cccc|c}
% \multicolumn{7}{l}{Average L1 metric across 16,384 points} \\
\toprule	
 \multicolumn{2}{c}{\small{Method}} 
 & \small{skin} & \small{sclera} & \small{iris} & \small{pupil} & \small{\emph{avg.}} \\
\midrule 
       \multirow{2}{*}{\rotatebox[origin=c]{90}{\centering \emph{\small{Voxel}}}} & 3D-EPN~\citep{dai2017shape} & 13.43 & 22.09 & 19.43 & 15.96 & 17.73 \\
       & ForkNet~\citep{wang2019forknet} & 17.04 & 14.75 & 18.16 & 14.78 & 16.18 \\
\midrule 
       \multirow{11}{*}{\rotatebox[origin=c]{90}{\centering \emph{\small{Point Cloud}}}} 
       & PointNet++~\citep{qi2017pointnet++} & 9.72 & 10.24 & 12.73 & 11.85 & 11.13 \\
       & FoldingNet~\citep{yang2018foldingnet} & 9.35 & 10.23 & 12.29 & 11.68 & 10.89 \\
       & TopNet~\citep{tchapmi2019topnet} & 8.82 & 10.22 & 11.82 & 11.02 & 10.48 \\ 
       & AtlasNet~\citep{Groueix_2018_CVPR} & 8.15 & 9.70 & 11.18 & 10.64 & 9.92 \\
       & PCN~\citep{yuan2018pcn} & 7.48 & 9.69 & 11.12 & 10.29 & 9.65 \\
       & MSN~\citep{liu2020morphing} & 7.01 & 9.10 & 10.63 & 9.77 & 9.13 \\
       & SoftPoolNet~\citep{wang_softpool} & 6.54 & 8.78 & 9.79 & 9.46 & 8.65 \\
       & GRNet~\citep{grnet_xie} & 6.22 & 8.70 & 9.68 & 9.14 & 8.44 \\
       & PMP-Net~\citep{wen2020pmp} & 5.74 & 8.26 & 8.98 & 8.83 & 7.96 \\
       & CRN~\citep{Wang_2020_CVPR} & 5.57 & 8.23 & 8.98 & 8.81 & 7.90 \\
       & SnowflakeNet~\citep{xiang2021snowflakenet} & 4.93 & 7.48 & 8.76 & 8.68 & 7.46 \\
       \midrule
       \multirow{3}{*}{\rotatebox[origin=c]{90}{\centering \emph{\small{Implicit}}}} & IF-Net~\citep{chibane20ifnet} & 5.43 & 7.98 & 7.95 & 7.09 & 7.12 \\
       & Points2Surf~\citep{erler2020points2surf} & 4.93 & 7.45 & 7.48 & 6.91 & 6.70 \\
       & \textbf{SecNet} & \textbf{4.21} & \textbf{6.99} & \textbf{7.17} & \textbf{6.25} & \textbf{6.15} \\
\bottomrule
\end{tabular}
%}

\caption{Evaluation of the semantic eye completion. 
% The Eye completion is reported with Chamfer distance trained with L1 distance (multiplied by $10^3$) with the 16,384 points in each semantic completed eye, where \emph{avg.} means the averaged Chamfer distance across 4 categories. 
We measure the Chamfer distance for each category; and, compute the average across all categories.
% \djt{update captions}
 \label{tab:complete}
}
%\vspace{10pt}
\end{table}

To highlight our contributions in this work, we conduct the following 3 experiments: synthetic eye augmentation, semantic eye completion and gaze estimation. 
% We empirically demonstrate the advantages of our approaches with a dataset called Sec-Eye (semantic completion eyes) which is rendered with meshes from UnityEyes~\citep{wood2016learning}.
We empirically demonstrate the advantages of our approaches using the dataset from \secref{sec:dataset}.
%

\subsection{Semantic eye completion}

We evaluate the semantic completion on the eye region on the dataset captured from UnityEyes~\citep{wood2016learning} models. The point clouds of the partial scans serves as the input for both training and testing. Their corresponding ground truth completed shape is presented in terms of a point cloud with semantic labels.
%
Our evaluations are carried out across 4 categories including skin, sclera, iris and pupil.
%
Note that, across all methods, the input partial scan is normalized into the same scale of point coordinates ranging between $-0.5$ to $0.5$.


This evaluation compares against the state-of-the-art completion methods that reconstruct using volumetric data, point cloud or implicit surface.
%
While \figref{fig:qualitatives} shows the qualitative results of different approaches, \tabref{tab:complete} highlights the numerical comparison between them. This table shows that we achieve the state-of-the-art results, reaching an average L1-Chamfer distance of $6.15\times10^{-3}$ on all the categories. It also shows that we have the best results across all the four categories.
It is noteworthy to mention that only ForkNet~\citep{wang2019forknet} and our approach perform semantic completion, while other methods can only infer the geometric completion. Due to this, for the other methods, we apply 3D-GCN~\citep{lin2020convolution} on the reconstruction to find the semantic labels.

% Evaluation based on 4,096 points in ground truth presented in \tabref{tab:complete} and \figref{fig:qualitatives} show that we have achieved the state of the art results reaching $6.15\times10^{-3}$ L1 Chamfer distance on average of 4 categories. 
% It is noteworthy that we use 3D-GCN~\citep{lin2020convolution} to segment the geometric completion result of all reported appraoches except for our proposed approach and ForkNet~\citep{wang2019forknet}, so that categorical geometric evaluation could be done. In terms of the average completion performance, our approach performs the best for each single categories in where skin is the easiest category with $4.21\times10^{-3}$ L1 Chamfer distance. 

% We compare against major point cloud completion approaches and a selection of volumetric and implicit reconstruction approaches. %such as 
% PCN~\citep{yuan2018pcn}, 
% FoldingNet~\citep{yang2018foldingnet}, 
% AtlasNet~\citep{Groueix_2018_CVPR}, 
% PointNet++~\citep{qi2017pointnet++}, SoftPoolNet~\citep{wang_softpool},
% MSN~\citep{liu2020morphing} and 
% GRNet~\citep{grnet_xie}. 

% Evaluation based on 4,096 points in ground truth presented in \tabref{tab:complete} and \figref{fig:qualitatives} show that we have achieved the state of the art results reaching $6.15\times10^{-3}$ L1 Chamfer distance on average of 4 categories. It is noteworthy that we use 3D-GCN~\citep{lin2020convolution} to segment the geometric completion result of all reported appraoches except for our proposed approach and ForkNet~\citep{wang2019forknet}, so that categorical geometric evaluation could be done. In terms of the average completion performance, our approach performs the best for each single categories in where skin is the easiest category with $4.21\times10^{-3}$ L1 Chamfer distance. 

If we investigate closely on the comparison against the volumetric methods, their error are significantly higher, \eg ForkNet~\citep{wang2019forknet} has an average Chamfer distance of $16.18 \times 10^{-3}$, since they only use a grid with an output dimension of $[64, 64, 64]$. To evaluate in the same metric, we convert the grid into a point cloud before evaluating the Chamfer distance. From \figref{fig:qualitatives}, we can observe that they have an obvious disadvantage due to their low resolution.


% To briefly show how the point cloud approaches differ from the volumetric, we also evaluate 3D-EPN~\citep{dai2017shape} and 
% ForkNet~\citep{wang2019forknet} and convert the $64 \times 64 \times 64$ voxel grid to point cloud in \tabref{tab:complete} by positioning points on every occupied voxels. 
% %
% % The volumetric approaches like 3D-EPN~\citep{dai2017shape} and ForkNet~\citep{wang2019forknet} are also reported  with points mapped from a fixed $64 \times 64 \times 64$ voxel grid. 
% %
% Due to their low resolution as shown in~\figref{fig:qualitatives}, they have an obvious disadvantage when they are evaluated with the Chamfer distance in ~\tabref{tab:complete}, where ForkNet is with an average Chamfer distance of $16.18 \times 10^{-3}$.
%

As for point cloud approaches, we compare our semantic eye completion results with some recently proposed methods such as FoldingNet~\citep{yang2018foldingnet}, PCN~\citep{yuan2018pcn}, MSN~\citep{liu2020morphing}, GRNet~\citep{grnet_xie} and SoftPoolNet~\citep{wang_softpool}.
%
All these approaches reconstructs a point cloud which is further re-sampled to 4,096 points for evaluation. Compared to SnowflakeNet~\citep{xiang2021snowflakenet} which is the state-of-the-art point cloud completion approach, our method achieves decreased the Chamfer distance by $1.31 \times 10^{-3}$. 

Lastly, when we compare against other implicit method, Points2Surf~\citep{erler2020points2surf} also performs well with a Chamfer distance of $6.70 \times 10^{-3}$. However, our proposed architecture performs the best among all listed approaches in \tabref{tab:complete} with an error of $6.15 \times 10^{-3}$.

% Similar to our proposed SecNet which estimate implicit value for each query point, Points2Surf~\citep{erler2020points2surf} also performs well for Chamfer distance of $6.70 \times 10^{-3}$. Our proposed architecture performs the best among all listed approaches in~\tabref{tab:complete} which is $6.15 \times 10^{-3}$ for Chamfer distance.

% All other approaches are visualized with the same 16,384 points except for FoldingNet with 2,048 points and MSN with 8,192 points. Our approach shows advantages in both smoothness and completion precision. 

% As shown in~\figref{fig:teaser}, the pupil and iris are both precisely reconstructed from noisy input where the defects on pupil is learned as a guide to present the gaze direction.


\subsection{Gaze estimation}



\begin{table}[!t]
\centering
% \resizebox{\linewidth}{!}
{
\begin{tabular}{c|c|l|cc|c|c}
% \multicolumn{5}{l}{Gaze Estimation} \\
\toprule	
 \multicolumn{1}{c}{} & \multicolumn{1}{c}{} & \multicolumn{1}{c}{} 
 & \emph{\small } & \multicolumn{1}{c}{\emph{\small Cosine }} & \multicolumn{1}{c}{\emph{\small Model Size}} & \multicolumn{1}{c}{\emph{\small Time}} \\
 \multicolumn{1}{c}{} & \multicolumn{1}{c}{} & \multicolumn{1}{c}{\small Method} 
 & \emph{\small Accuracy} & \multicolumn{1}{c}{\emph{\small Distance}} & \multicolumn{1}{c}{\emph{\small   (MB)}} & \multicolumn{1}{c}{\emph{\small  (seconds)}} \\
\midrule 
       \multicolumn{2}{c|}{\multirowcell{3}{\emph{\small{Direct}} \\\emph{\small{Gaze}}\\ \emph{\small{Estimation}}}} 
       & EMGE~\citep{zhou20163d} & 42.1\% & 0.637 & -- & 0.14 \\
       \multicolumn{2}{c|}{} 
       & RTGE~\citep{sun2015real} & 56.9\% & 0.691 & -- & 0.08 \\
       \multicolumn{2}{c|}{} 
       & 3D-GCN~\citep{lin2020convolution} & 61.8\% & 0.745 & 6.6 & 0.82 \\
\midrule 
	\multirow{16}{*}{\rotatebox[origin=c]{90}{\centering \emph{\small{Gaze from Semantic Eye Completion}}}}
       &\multirow{2}{*}
       {\rotatebox[origin=c]{90}{\centering \emph{\small{Voxel}}}}
       & 3D-EPN~\citep{dai2017shape} & 81.4\% & 0.802 & 420.0 & 0.82 \\
       && ForkNet~\citep{wang2019forknet} & 83.8\% & 0.809 & 362.0 & 1.12 \\
\cmidrule{2-7}	
       &\multirow{11}{*}%{\emph{p.c.}}
       {\rotatebox[origin=c]{90}{\centering \emph{\small{Point Cloud}}}}
       & PointNet++~\citep{qi2017pointnet++} & 82.9\% & 0.781 & 29.7 & 2.33 \\
       && FoldingNet~\citep{yang2018foldingnet} & 84.6\% & 0.807 & 19.2 & 0.05 \\
       && TopNet~\citep{tchapmi2019topnet} & 85.1\% & 0.822 & 79.9 & 0.61 \\ 
       && AtlasNet~\citep{Groueix_2018_CVPR} & 85.2\% & 0.821 & 2.0 & 0.32  \\
       && PCN~\citep{yuan2018pcn} & 87.3\% & 0.823 & 54.8 & 0.11 \\
       && MSN~\citep{liu2020morphing} & 88.0\% & 0.830 & 12.0 & 0.21 \\
       && SoftPoolNet~\citep{wang_softpool} & 89.2\% & 0.842 & 37.2 & 0.04 \\
       && GRNet~\citep{grnet_xie} & 91.6\% & 0.857 & 293.0 & 0.88  \\
	   && PointCNN~\citep{li2018pointcnn} & 87.6\% & 0.826 & 497.0 & 1.20  \\
       && PMP-Net~\citep{wen2020pmp} & 90.6\% & 0.850 & 22.0 & 4.21 \\
       && CRN~\citep{Wang_2020_CVPR} & 93.1\% & 0.884 & 61.5 & 2.73 \\
\cmidrule{2-7}	
       &\multirow{3}{*}%{\emph{imp.}}
       {\rotatebox[origin=c]{90}{\centering \emph{\small{Implicit}}}}
       & IF-Net~\citep{chibane20ifnet} & 93.6\% & 0.909 & 29.4 & 9.27 \\
       && Points2Surf~\citep{erler2020points2surf} & 94.3\% & 0.921 & 24.0 & 12.64 \\
       && \textbf{SecNet} & \textbf{97.6\%} & \textbf{0.971} & 9.7 & 0.19 \\
    %   && \textbf{SecNet} \without{LD} & 97.2\% & 0.964 \\
\bottomrule
\end{tabular}
}

\caption{Evaluation of the gaze direction classification and estimation with the corresponding model size and inference time. 
The table is divided into two regions. The methods on top directly use the depth image to find the gaze; 
while, the methods at the bottom estimates the gaze based on the semantic eye completion. 
%The classification accuracy is based on the nine gaze directions; and, the cosine distance is the dot product of the gazing direction between the estimated and ground truth. 
Note that \citep{zhou20163d, sun2015real} does not depend on a parameterized inference model. 
%
% \emph{scan} report approaches which do not complete the eye. Performance of completion approaches are reported in 3 types including \emph{vol.} (volumetric data), \emph{p.c.} (point cloud) and \emph{imp.} implicit reconstruction. \emph{accuracy} means the gazing direction classification results for 9 directions. \emph{cosine distance} means the cosine distance calculated for gazing estimation where gazing direction is with minor changes when the general gazing direction is straight-forward.
% \djt{What does scan mean? Why is 3D-GCN there?}
\label{tab:gaze}
% \vspace{-5pt}
}
% \vspace{10pt}
\end{table}

% \begin{table}[!t]
% \centering
% %\resizebox{\linewidth}{!}
% %{
% \begin{tabular}{l|l|c|c}
% 	\toprule	
% 		\multicolumn{1}{c}{} & \multicolumn{1}{c}{Method}
% 		 & \multicolumn{1}{c}{Size (MB)} & \multicolumn{1}{c}{Time (seconds)}  \\
% 		%\multicolumn{1}{c}{} 
% 		%& (MB) & Time~(s) & Surface & Data\\
% 	\midrule 
%     \multirow{2}{*}{\rotatebox[origin=c]{90}{\centering \emph{Voxel}}}	
%     & 3D-EPN~\citep{dai2017shape} & 420.0 & 0.82 \\
%     & ForkNet~\citep{wang2019forknet} & 362.0 & 1.12 \\
% 	\midrule
%       \multirow{7}{*}{\rotatebox[origin=c]{90}{\centering \emph{Point Cloud}}}
% 		& GRNet~\citep{grnet_xie} & 293.0 & 0.88  \\
% 		& PointCNN~\citep{li2018pointcnn} & 497.0 & 1.20  \\
% 		& FoldingNet~\citep{yang2018foldingnet} & 19.2 & 0.05 \\
% 	    & AtlasNet~\citep{Groueix_2018_CVPR} & 2.0 & 0.32  \\
% 		& PCN~\citep{yuan2018pcn} & 54.8 & 0.11 \\
% 		& MSN~\citep{liu2020morphing} & 12.0 & 0.21 \\
% 		& SoftPoolNet~\citep{wang_softpool} & 37.2 & 0.04 \\
% 	\midrule
%       \multirow{3}{*}{\rotatebox[origin=c]{90}{\centering \emph{Implicit}}}
% 		& DeepSDF~\citep{park2019deepsdf} & 7.4 & 9.72 \\
% 	    & Points2Surf~\citep{erler2020points2surf} & 24.0 & 12.64 \\
% 	    &$\textbf{SecNet}$ & 9.7 & 0.19 \\
%         % & $\textbf{SecNet}$ \without{LD} & 10.1 & 1.72 \\
% 	\bottomrule
% 	\end{tabular}
% %}
% 	% \setlength{\belowcaptionskip}{\RemoveBelowCaption}
% \caption{Overview of different 3D completion methods.
% % \emph{size} is present in MB, \emph{time} is presented i n seconds.
% %Our average is calculated without door category. \fede{remove door} 
% \label{tab:param}
% }
% \end{table}

Since we can convert the semantic eye completion to gaze direction through \secref{sec:gaze}, this section focuses on the evaluation of the gaze direction. In addition to the gaze from semantic completion, we also include the related work that designed for gaze estimation such as EMGE~\citep{zhou20163d} and RTGE~\citep{sun2015real}. These methods directly locate the pupil and estimate the center of eyeball from 2D partial scan without eye completion.
We also include 3D-GCN~\citep{lin2020convolution} which segments the input partial scan into parts prior to the gaze estimation.

We first consider this as a classification problem where we match the estimated gaze based on the nine directions. 
Comparing with other methods in \tabref{tab:gaze}, our approach reached a classification accuracy of 97.6\% which is significantly higher than any other approach.  

Instead of relying only on discrete values, we also considered the cosine distance to evaluate the estimated gaze from the ground truth, which is the dot product of the two vectors. Here, our approach also produces the best performance of 0.971.


% Since our dataset includes the nine gaze directions, we can then experiment as the classification of the 9-categorical directions based on \secref{sec:gaze}.

% As we have synthetically rendered the training and validation dataset in 9 gaze directions, the 9-categorical classification experiments are carried out on the semantically completed 3D shape of eye regions. 
% We qualitatively demonstrate the reconstructed 9 gaze directions for the same identity in \figref{fig:gazings}. The 9 category classification \emph{accuracy} in \tabref{tab:gaze} is reported by taking semantic eye surface from each approach as input for a PointNet~\citep{qi2017pointnet} classifier. Served as baseline comparisons, EMGE~\citep{zhou20163d} and RTGE~\citep{sun2015real} directly locates the pupil and estimate center of eyeball from 2D partial scan without eye completion. Also not using geometric completion, 3D-GCN~\citep{lin2020convolution} locates the center of pupil and eyeball by point cloud segmentations with the help of point cloud convolution which achieves an accuracy of 61.8\%. Performance of completion approaches are reported in 3 types including \emph{vol.} (volumetric data), \emph{p.c.} (point cloud) and \emph{imp.} implicit reconstruction. We further report gazing direction estimation with minor changes when the general gazing direction is straight-forward in \emph{cosine distance} between ground truth direction and estimated results, where our approach produces the best performance of 0.964.

% \djt{Are there qualitative comparisons?} \yida{currently no}

\subsection{Efficiency}



% \djt{Ablation study on the parameters.}

We also evaluate the processing time and the corresponding memory footprint of each model, which is summarized in \tabref{tab:gaze}. 
This table illustrates that our inference time at 0.19 seconds is much faster than the other implicit reconstruction methods such as DeepSDF~\citep{park2019deepsdf} at 9.72 seconds and Points2Surf~\citep{erler2020points2surf} at 12.64 seconds. 
This is because the other methods require zero-crossing in reconstruction while our method does not. 
%
% \yida{
In addition, our point-wise implicit estimation is conditioned on a SoftPool feature from the encoder, which is processed once for each partial scan. 
Points2Surf, on the other hand, decodes the implicit values using QSTN~\citep{guerrero2018pcpnet} which depends on analyzing the local point cloud patches. This implies that it needs to be executed repetitively for each query point.
% } 
%
As a consequence, we can reduce the time by focusing on areas surrounding the partial scan to decrease the number of query points to process. 
%
Overall, we do not attain the lowest memory footprint or the lowest inference time. However, we argue that our approach has a good trade-off between the two parameters.

% \tabref{tab:param} lists the size of the model and inference time for each single scan during completion. Notice that our inference time is 0.19 seconds, which is much faster than other implicit reconstruction approaches such as DeepSDF~\citep{park2019deepsdf} in 9.72 seconds and Points2Surf~\citep{erler2020points2surf} in 12.64 seconds. Our approach only focuses on areas surrounding the partial scan so that number of query points could be much smaller than other implicit reconstruction approaches.



\subsection{Ablation study}

% \djt{intro}
% We add an ablation study that 

We perform an ablation study to understand the effect of changes in the hyperparameters. \tabref{tab:architecture_ablation} summarizes this study.
%
We noticed that the MLP in the encoder does not significantly change the completion performance as long as the output dimension of the first layer of the MLP is larger than 256. Having values above 512 only improves the average Chamfer distance from $6.15 \times 10^{-3}$ to $6.09 \times 10^{-3}$. 

Since our latent feature is extracted by SoftPool~\citep{wang_softpool} operators, we validate the changes in performance by adapting different input feature dimension $N_f$ and number of points $N_r$ chosen from each of the sequential feature map.  
Our experimental results show that the performance ranges from $6.10 \times 10^{-3}$ to $6.84 \times 10^{-3}$. 
This is validated by \citep{wang_softpool} where we can reach a good performance as long as the feature dimension $N_f$ is larger than 4. 
%
For the MLP in decoder, we found that performance saturates when $D_\text{out} = [16, 32, 64, 5]$.
 
 
% \tabref{tab:architecture_ablation} shows that changing hyper parameters in encoder MLP does not change the completion performance obviously as long as the output dimension of the fisrt layer of MLP is layer than 256. The completion performance is improved from average Chamfer distance of $6.15 \times 10^{-3}$ to $6.09 \times 10^{-3}$. 
% As our latent feature is extracted by SoftPool~\citep{wang_softpool} operators, we validate the performance changes by adopting different input feature dimension $N_f$ and number of points $N_r$ chosen from each sequential feature map ordered by SoftPool~\citep{wang_softpool}. Experimental results show that performance ranges between $6.10 \times 10^{-3}$ to $6.84 \times 10^{-3}$. as long as feature dimension $N_f$ is larger than 4. 
% In terms of the MLP in decoder, we found that performance is saturated once $D_\text{out} = [16, 32, 64, 5]$.

\begin{table}[t]
\centering
%\resizebox{0.9\linewidth}{!}
%{
\begin{tabular}{l|l|c}
% \multicolumn{2}{c}{SecNet \small{(Transformer)}} 
\toprule	
\multicolumn{1}{c}{Modules} & \multicolumn{1}{c}{Parameters} & \multicolumn{1}{c}{Chamfer Distance} \\
\midrule
 & Gaussian~\citep{tancik2020fourier}  &6.42 \\
 positional coding & SIREN~\citep{sitzmann2019siren} & \textbf{6.15} \\
 & sinusoidal~\citep{vaswani2017attention} & 6.31 \\
\midrule
 & $D_\text{out} = [512, 256, 8]$ & 6.99 \\
 & $D_\text{out} = [256, 512, 8]$ & 7.33 \\
MLP (encoder) & $D_\text{out} = [512, 512, 8]$ & \textbf{6.15} \\
 & $D_\text{out} = [1024, 512, 8]$ & 6.12 \\
 & $D_\text{out} = [512, 1024, 8]$ & 6.09 \\
\midrule
 & $N_r$ = 16, $N_f$ = 8 & 6.84 \\
 & $N_r$ = 32, $N_f$ = 4 & 8.07 \\
Softpool~\citep{wang_softpool} & {$N_r$ = 32, $N_f$ = 8} & 
\textbf{6.15} \\
 & $N_r$ = 32, $N_f$ = 16 & 6.10 \\
 & $N_r$ = 64, $N_f$ = 8 & 6.14 \\
\midrule
 & $D_\text{out} = [16, 16, 64, 5]$ & 7.04 \\
 & $D_\text{out} = [16, 32, 32, 5]$ & 6.60 \\
MLP (decoder) & $D_\text{out} = [16, 32, 64, 5]$ & \textbf{6.15} \\
 & $D_\text{out} = [16, 64, 64, 5]$ & 6.13 \\
 & $D_\text{out} = [16, 32, 128, 5]$ & 6.11 \\
\bottomrule
\end{tabular}
%}

\caption{Ablation study on network hyperparameters. The results in bold indicate the chosen parameters in architectural design which balance the accuracy and model size. 
\label{tab:architecture_ablation}}
\end{table}

\section{Conclusion}

In this paper, we propose to complete the eye region through \emph{semantic implicit field}. 
%
Using our semantic eye completion, we also introduce a more practical use-case, \ie gaze estimation.
%
We achieve the state-of-the-art performance for both semantic eye completion and gaze estimation. % \yida{changed direction classification into estimation, while I have done both classification and detailed estimation}
% 
Since we propose a new problem in semantic completion and a new type of solution for gaze estimation, we propose a simple way to build the dataset for semantic completion eyes based on UnityEyes~\citep{wood2016learning} meshes to train and evaluate the models.


% \acks{Acknowledgements go here.}

\newpage
\bibliography{pmlr-sample}

% \appendix



% \section{Second Appendix}\label{apd:second}

% This is the second appendix.

\end{document}
