Sept. 22, 2022, 1:14 a.m. | Hao Li, Jinfa Huang, Peng Jin, Guoli Song, Qi Wu, Jie Chen

cs.CV updates on arXiv.org arxiv.org

Text-based Visual Question Answering~(TextVQA) aims to produce correct
answers for given questions about the images with multiple scene texts. In most
cases, the texts naturally attach to the surface of the objects. Therefore,
spatial reasoning between texts and objects is crucial in TextVQA. However,
existing approaches are constrained within 2D spatial information learned from
the input images and rely on transformer-based architectures to reason
implicitly during the fusion process. Under this setting, these 2D spatial
reasoning approaches cannot distinguish the …

arxiv human human-like question answering reasoning text

More from arxiv.org / cs.CV updates on arXiv.org

Senior Machine Learning Engineer

@ Kintsugi | remote

Staff Machine Learning Engineer (Tech Lead)

@ Kintsugi | Remote

R_00029290 Lead Data Modeler – Remote

@ University at Buffalo | Austin, TX

R_00029290 Lead Data Modeler – Remote

@ University of Texas at Austin | Austin, TX

Senior AI/ML Developer

@ Lemon.io | Remote

Data Engineer (Contract)

@ PlayStation Global | United States, Remote