Enhancing Two-Stage Object Detection in Aerial Imagery with a Compact Convolutional Transformer RoI Head
Track: Type A (Regular Papers)
Keywords: Object Detection, Two-stage Detector, Convolutional Transformer, Aerial Images
Abstract: Detecting small objects in high-resolution aerial imagery remains a challenging task due to the extreme class imbalance, dense object layouts, and the information loss incurred by aggressive downsampling in standard detection pipelines. In this paper, we introduce CCTdeT (Compact Convolutional Transformer Detector), a two-stage detector that enhances Faster R-CNN by replacing its conventional Region of Interest (RoI) head with a Compact Convolutional Transformer (CCT). The CCT module preserves spatial details using a lightweight convolutional tokenizer, captures global context via transformer encoder layers, and performs classification and bounding-box regression through attention-based sequence pooling. This hybrid design retains the localization precision of two-stage frameworks while significantly improving small-object feature representation. Evaluated on the challenging VisDrone benchmark, CCTdeT achieves a mean Average Precision (mAP) of 0.276 and a mAP@0.50 of 0.510, outperforming the Faster R-CNN baseline. Moreover, it reduces computational cost by 30% (373.4 → 264.9 GFLOPs) and increases inference speed (3 → 4 FPS). Detailed per-class analyses confirm the efficacy of our approach, demonstrating that modernizing the RoI head with transformer-based modules can yield substantial gains in small-object detection.
Serve As Reviewer: ~Mirela_Popa1
Submission Number: 81
Loading