Self-Attention with Convolution and Deconvolution for Efficient Eye Gaze Estimation from a Full Face Image

Abstract: This paper proposes a whole new face image-based eye gaze estimation network to solve low generalization performance. Due to the high variance of facial appearance and environmental conditions, conventional methods in gaze estimation have low generalization performance and are easily overfitted to training subjects. To solve this problem, we adopt a self-attention mechanism that has better generalization performance. Nevertheless, applying self-attention directly to an image incurs a high computational cost. Thus, we introduce a new projection that uses convolution in the entire face image to accurate model the local context and reduce the computational cost of self-attention. The pro-posed model also includes deconvolution that transforms the down-sampled global context to the same size as the input so that spatial information is not lost. We confirmed through observations that the new method achieved state of the art on the EYEDIAP, MPIIFaceGaze, Gaze360 and RT-GENE datasets and achieved a performance increase of 0.02° to 0.30° compared to the other state of the art model. In addition, we show the generalization performance of the proposed model through a cross-dataset evaluation.
0 Replies
Loading