Keywords: Face Recognition, Computer Vision
Abstract: Vision Transformers (ViTs) are gaining popularity for a range of tasks beyond image classification, including face recognition (FR). ViTs split an input image into patches and utilize self-attention, enabling interactions among patches to capture both local and global relationships. However, standard ViTs lack strong inductive biases, such as spatial priors, which can make it challenging to efficiently learn both fine-grained local features and coarse global structural patterns, ultimately affecting performance. To address this limitation, we propose to inject global semantic information that provides the model with a holistic signal to guide the learning of spatial relationships. Specifically, we introduce a Global Context Token (GCT) to the ViT architecture for FR. The GCT is a learnable token appended to the input patch sequence and interacts with all patch tokens through self-attention, providing complementary global context and enhancing the discriminative power of the resulting context-aware representations. We empirically proved that ViT with GCT outperforms vanilla ViT for FR on all considered benchmarks. Our analysis of attention maps and patch-wise discriminative ability demonstrates that the GCT directs focus more on the eye regions, which are widely recognized as the most discriminative facial areas for FR, whereas other configurations exhibit a more evenly distributed attention. When compared to previous ViT-based FR works, our approach achieves SOTA results when trained on datasets like MS1MV2 and WebFace4M, ranking first among ViT-based models on the IJB-B and IJB-C benchmarks. These findings highlight the GCT effectiveness in enriching global representation and improving FR robustness.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 12153
Loading