all AI news
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models. (arXiv:2206.08155v1 [cs.CV])
Web: http://arxiv.org/abs/2206.08155
cs.CL updates on arXiv.org arxiv.org
Video question answering (VideoQA) is a complex task that requires diverse
multi-modal data for training. Manual annotation of question and answers for
videos, however, is tedious and prohibits scalability. To tackle this problem,
recent methods consider zero-shot settings with no manual annotation of visual
question-answer. In particular, a promising approach adapts frozen
autoregressive language models pretrained on Web-scale text-only data to
multi-modal inputs. In contrast, we here build on frozen bidirectional language
models (BiLM) and show that such an approach …
arxiv cv language language models models question answering video