Abstract: Highlights•GK-MVLP used fine-grained visual–knowledge alignment for representation learning.•Knowledge prompts enhanced localization and prevented irrelevant information.•GK-MVLP exceeded SOTA in classification, localization, report generation, and VQA.
Loading