all AI news
ERNIE-mmLayout: Multi-grained MultiModal Transformer for Document Understanding. (arXiv:2209.08569v1 [cs.CV])
cs.CV updates on arXiv.org arxiv.org
Recent efforts of multimodal Transformers have improved Visually Rich
Document Understanding (VrDU) tasks via incorporating visual and textual
information. However, existing approaches mainly focus on fine-grained elements
such as words and document image patches, making it hard for them to learn from
coarse-grained elements, including natural lexical units like phrases and
salient visual regions like prominent image regions. In this paper, we attach
more importance to coarse-grained elements containing high-density information
and consistent semantics, which are valuable for document understanding. …
arxiv document understanding multimodal transformer understanding