Abstract: Face recognition accuracy is now at the point that it is comparable to human performance, largely driven by improvements in face embedding models. However, face embedding models rely on high-quality images, making them vulnerable to variations in quality, illumination, pose, and occlusion in real-world scenarios. To address this, we propose a Transformer-based aggregation method for face verification that leverages multiple embeddings to improve the robustness under adverse conditions. The model refines frame-level embeddings computed by a high-performance single-frame feature extractor using a Transformer encoder and is trained with an adaptive triplet loss that applies separate adaptive margins for positive and negative pairs. We train the embedding aggregation model with a two-phase strategy: pretraining with average pooling alignment and fine-tuning with adaptive triplet loss. We train the embedding aggregation model on the YouTube Faces dataset and evaluate it on unseen data from IMFDB, CASIA-WebFace, CCVID, IJB-B, and IJB-C, where it outperforms the baseline on most datasets, especially IMFDB, which has significant variations in pose, lighting, and scene context. Our approach is an effective way to build upon a single-frame face embedding model, leveraging multiple images for accurate face verification under adverse conditions.
External IDs:dblp:journals/access/MoonrintaDE25
Loading