May 5, 2022, 1:10 a.m. | Revanth Gangi Reddy, Xilin Rui, Manling Li, Xudong Lin, Haoyang Wen, Jaemin Cho, Lifu Huang, Mohit Bansal, Avirup Sil, Shih-Fu Chang, Alexander Schwin

cs.CV updates on arXiv.org

Recently, there has been an increasing interest in building question
answering (QA) models that reason across multiple modalities, such as text and
images. However, QA using images is often limited to just picking the answer
from a pre-defined set of options. In addition, images in the real world,
especially in news, have objects that are co-referential to the text, with
complementary information from both modalities. In this paper, we present a new
QA evaluation benchmark with 1,384 questions over news …

