all AI news
VLCDoC: Vision-Language Contrastive Pre-Training Model for Cross-Modal Document Classification. (arXiv:2205.12029v1 [cs.CV])
cs.CV updates on arXiv.org arxiv.org
Multimodal learning from document data has achieved great success lately as
it allows to pre-train semantically meaningful features as a prior into a
learnable downstream approach. In this paper, we approach the document
classification problem by learning cross-modal representations through language
and vision cues, considering intra- and inter-modality relationships. Instead
of merging features from different modalities into a common representation
space, the proposed method exploits high-level interactions and learns relevant
semantic information from effective attention flows within and across
modalities. …
arxiv classification cv language pre-training training vision