Abstract: Computer vision and natural language processing (NLP) are the two active machine learning research areas. In the recent decades, the integration of these two areas gives rise to a new interdisciplinary field, which is currently attracting more attention of researchers. Transformer-based model, an approach which had great success in NLP tasks, is now being widely utilized in computer vision projects relating to image classification and object detection. However, there are not many studies exploiting this technique in face recognition tasks. Besides, when applying a deep learning model in real-world applications such as face recognition task, it is hard to have a sufficient number of samples like standard datasets, especially for very important persons. To simulate the practical condition, we build a new face dataset named VFD, which contains 3000 images of 1000 famous Vietnamese people. Moreover, to tackle face recognition in the context of limited training samples, we adopt the baby learning approach to train the Vision Transformer (ViT). The method trains the ViT models via iterations that gradually increase the number of parameters to enhance the generalization. Finally, we validate the method with our dataset with various configurations of face tasks to demonstrate the robustness of the model. The experiments show that our method has comparative results in general and outperforms in the context of scarce data compared to CNN-based methods such as ResNet.
0 Replies
Loading