Web: http://arxiv.org/abs/2206.08155

June 17, 2022, 1:12 a.m. | Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid

cs.CL updates on arXiv.org arxiv.org

Video question answering (VideoQA) is a complex task that requires diverse
multi-modal data for training. Manual annotation of question and answers for
videos, however, is tedious and prohibits scalability. To tackle this problem,
recent methods consider zero-shot settings with no manual annotation of visual
question-answer. In particular, a promising approach adapts frozen
autoregressive language models pretrained on Web-scale text-only data to
multi-modal inputs. In contrast, we here build on frozen bidirectional language
models (BiLM) and show that such an approach …

arxiv cv language language models models question answering video

More from arxiv.org / cs.CL updates on arXiv.org

Machine Learning Researcher - Saalfeld Lab

@ Howard Hughes Medical Institute - Chevy Chase, MD | Ashburn, Virginia

Project Director, Machine Learning in US Health

@ ideas42.org | Remote, US

Data Science Intern

@ NannyML | Remote

Machine Learning Engineer NLP/Speech

@ Play.ht | Remote

Research Scientist, 3D Reconstruction

@ Yembo | Remote, US

Clinical Assistant or Associate Professor of Management Science and Systems

@ University at Buffalo | Buffalo, NY