Abstract: Highlights•A multi-granularity feature fusion module is proposed to solve the limitations of single-scale features.•A two-stage classification based on Vision-Transformer is proposed to reduce background interference on predictions. By leveraging the ViT model, the object can be separated from the background and the details can be enlarged.•Extensive experiments prove the superiority of our model. The visualization results illustrate that our two-stage classification can accurately localize objects and facilitate correct predictions.
Loading