Web: http://arxiv.org/abs/2209.10326

Sept. 22, 2022, 1:14 a.m. | Hao Li, Jinfa Huang, Peng Jin, Guoli Song, Qi Wu, Jie Chen

cs.CV updates on arXiv.org arxiv.org

Text-based Visual Question Answering~(TextVQA) aims to produce correct
answers for given questions about the images with multiple scene texts. In most
cases, the texts naturally attach to the surface of the objects. Therefore,
spatial reasoning between texts and objects is crucial in TextVQA. However,
existing approaches are constrained within 2D spatial information learned from
the input images and rely on transformer-based architectures to reason
implicitly during the fusion process. Under this setting, these 2D spatial
reasoning approaches cannot distinguish the …

arxiv human human-like question answering reasoning text

