Abstract: Highlights•Rich semantics can be developed in representation learning by leveraging.•The inter-data relationships can be modeled using external knowledge.•Additional cross-modal alignment leads to better visual understanding.•Fine-grained inter-data similarity can serve as soft targets for cross-modal alignment.
External IDs:dblp:journals/cviu/WeiKC25
Loading