Reproducibility Report RetinaFace: Single-shot Multi-level Face Localization in the WildDownload PDF

31 Jan 2021 (modified: 05 May 2023)ML Reproducibility Challenge 2020 Blind SubmissionReaders: Everyone
Keywords: Face Detection, Landmark Localization, Vision
Abstract: Scope of Reproducibility: RetinaFace is a deep learning model that detects faces in images by proposing rectangular areas (bounding boxes) for every single face. Unlike the other current state-of-the-art models, this study proposes a multi-task loss calculation by also computing the coordinates of 5 facial landmarks (eyes, nose, and two sides of the mouth) and 3D face mesh with 1000 points concurrently. Additionally, the proposed model also adapts a cascaded structure and deformable convolution layers (DCL). The scope of this paper includes the whole model structure excluding DCL. Additionally, The tasks implemented are limited only to face bounding box detection and landmark localization tasks, since the 3D point detection database is not publicly shared. Methodology: For this challenge, I implemented this model in Julia programming language, by using the Knet deep learning framework. The whole model is implemented from scratch. There are official and unofficial implementations are available but these codes only contain a subset of the whole model proposed in the paper. In the context module part and for constructing the methods related to box proposal, these repositories are taken as examples. For training, the WIDER FACE database is preferred and as landmark data, custom annotations created by the original paper's authors are used. Model is trained in one Tesla V100 GPU with a batch size of 10 for 60 epochs, which lasted approximately 9 days. Results: The average precision (AP) metric is used for evaluation and the results are 0.093 lower in the Easy, 0.076 in the Medium, and 0.129 in the Hard subsets of WIDER FACE. Possible reasons for this performance difference are discussed in the Limitations & Problems section. What was easy: Since the model only uses a small set of operations (convolution, batch normalization, unpooling, softmax, and ReLU). Therefore implementing the whole model was easy except for the loss calculation part. What was difficult: The selection process of which box proposals are for faces and which are for background and how to balance their losses were not explained in the original paper in detail. Because of these obscurities, implementing the loss calculation was difficult. Communication with original authors: I contacted them to request access to the 3D face points database but learned that that data belongs to a start-up company and is not publicly licensed.
Paper Url: https://openreview.net/forum?id=sOAVri2xt5&noteId=BAh_YZjiidS
Supplementary Material: zip
4 Replies

Loading