Abstract: Highlights•Propose the image and text cross-modal feature fusion ITF-WPI model for identifying 17 pests common to wolfberry.•Introduce contextual Transformer network and Pyramid Squeezed Attention (PSA) mechanism for visual recognition into the model.•The class convolutional neural network-long-term memory (CNN-LSTM) model constructed by stacking 1D convolutional and bidirectional long and short-term memory (BiLSTM) networks achieved competitive performance.•An image and text dataset was constructed for application to wolfberry pest identification scenarios. The text explains the pest images, and the description contains the scientific name profile, source distribution, habitat, and control methods.
Loading