Web: http://arxiv.org/abs/2108.09717

Jan. 31, 2022, 2:10 a.m. | Arka Ujjal Dey, Ernest Valveny, Gaurav Harit

cs.CV updates on arXiv.org arxiv.org

The open-ended question answering task of Text-VQA requires reading and
reasoning about local, often previously unseen, scene-text content of an image.
We address this zero-shot nature of the problem by proposing the generalized
use of external knowledge to augment our understanding of the said scene-text.
We design a framework to extract, validate, and reason with knowledge using a
standard multimodal transformer for vision language understanding tasks.
Through empirical evidence and qualitative results, we demonstrate how external
knowledge can highlight instance-only …

arxiv cv knowledge text

