VLCDoC: Vision-Language contrastive pre-training model for cross-Modal document classification

Published: 01 Jan 2023, Last Modified: 13 Nov 2024Pattern Recognit. 2023EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Highlights•Design vision-language multimodal attention-based model VLCDoC for document analysis.•InterMCA and IntraMSA attention modules effectively align the crossmodal features.•Multimodal contrastive pretraining is proposed to learn vision-language features.•A good generality of the learned multimodal domain-agnostic features is demonstrated.
Loading