Abstract: In the field of computer vision, the task of object detection is one of the important tasks, with many challenges and practical applications. Its task is to classify and determine the location of objects in images or videos. Many machine learning methods, especially deep learning, have been developed to perform this task. This article introduces a model combining DETR (DEtection TRansfomer) and ViT (Vision Transformer) as a method to recognize objects in images/videos that only use components of the Transformer model. The DETR model achieves good results in object detection using the Transformer architecture and without the need for complex intermediate steps. The ViT model, a Transformer-based architecture, has brought about a breakthrough in image classification. Combining both architectures opens exciting prospects in computer vision. The input image automatically extracted features using the ViT model previously trained on the ImageNet21K dataset, then the features will be fed into the Transformer model to find the classification and bounding box of the objects. Experimental results on test data sets show that this combined model has better ability in object recognition than DETR and ViT alone. This brings important prospects for the application of the Transformer model not only in the field of natural language processing but also in the field of image classification and object detection. The results of our proposed model have quite high mAP@0.5= 0.444 accuracy, slightly better than the original DETR model. The code is available at https://github.com/nguyendung622/vitdetr.
Loading