May 12, 2023, 12:45 a.m. | Shoubin Yu, Jaemin Cho, Prateek Yadav, Mohit Bansal

cs.CV updates on arXiv.org arxiv.org

Recent studies have shown promising results on utilizing pre-trained
image-language models for video question answering. While these image-language
models can efficiently bootstrap the representation learning of video-language
models, they typically concatenate uniformly sampled video frames as visual
inputs without explicit language-aware, temporal modeling. When only a portion
of a video input is relevant to the language query, such uniform frame sampling
can often lead to missing important visual cues. Although humans often find a
video moment to focus on and …

arxiv bootstrap image language language model language models localization modeling question answering representation representation learning studies temporal video

AI Research Scientist

@ Vara | Berlin, Germany and Remote

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

Senior Machine Learning Engineer

@ Samsara | Canada - Remote