Abstract: Highlights•Design vision-language multimodal attention-based model VLCDoC for document analysis.•InterMCA and IntraMSA attention modules effectively align the crossmodal features.•Multimodal contrastive pretraining is proposed to learn vision-language features.•A good generality of the learned multimodal domain-agnostic features is demonstrated.
Loading