Abstract: Highlights•Develop binding learning mechanism to facilitate cross-modal feature interaction.•Introduce optimized transformer structure to filter out inconsistent noise.•Introduce CLS and PE feature vectors in the modality fusion process.
Loading