Abstract: Vision transformers (ViTs) have continuously achieved new milestones in computer vision.
A natural usage of ViTs in detection is to replace the CNN-based backbone with a transformer-based backbone directly, but with the price of considerable computation burden for their deployment on resource-limited edge devices.
More potential usage is the DETR family, which eliminates the need for many hand-designed components in object detection but still cannot reach real-time edge applications.
In this paper, we propose a novel hardware-efficient adaptive-thinning DETR (HeatDETR), achieving high speed inference on multiple edge devices and even the realtime, for the first time.
Specifically, it mainly includes three contributions:
1) For decent detection performance, we introduce a backbone design principle based on the visual modeling process that focuses on locality to globality. Meanwhile, we propose a semantic-augmented module (SAM) in the backbone with the global modeling capabilities of self-attention to enhance low-level semantics. We also introduce an attention-based task-couple module (TCM) to reduce contradictions between classification and regression tasks.
2) For on-device efficiency, we propose a scale-combined module (SCM), through which we transform the multi-level detection process into the single-level process, releasing the multi-branch inference for higher hardware speed while maintaining detection performance.
Then we first revisit network architectures and operators used in ViT-based models, reparametered CNNs, identify hardware-efficient design and introduce basic HeatDETR structure.
3) Based on our device-adaptive model-thinning strategy, deployable end-to-end HeatDETR on target devices can be generated efficiently.
Experiments on the MS COCO dataset show HeatDETR outperforms current DETR-based methods by 0.3%~6.2% AP with 5%~68% speedup on a single Tesla V100.
Even real-time inference can be achieved on extremely memory-constrained devices, e.g., Dual-Core Intel Core i7 CPU.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Supplementary Material: zip
Please Choose The Closest Area That Your Submission Falls Into: Deep Learning and representational learning
4 Replies
Loading