Abstract: Highlights•A two-stage fusion method is proposed, which combines GNN and transformers.•Taking full advantage of two modalities information by two ways in the second stage.•Vision-language alignment vector is employed to enhance multimodal fusion.•Good performance on both full dataset and dataset with fewer samples.