Web: http://arxiv.org/abs/2112.01529

Jan. 24, 2022, 2:11 a.m. | Rui Wang, Dongdong Chen, Zuxuan Wu, Yinpeng Chen, Xiyang Dai, Mengchen Liu, Yu-Gang Jiang, Luowei Zhou, Lu Yuan

cs.LG updates on arXiv.org arxiv.org

This paper studies the BERT pretraining of video transformers. It is a
straightforward but worth-studying extension given the recent success from BERT
pretraining of image transformers. We introduce BEVT which decouples video
representation learning into spatial representation learning and temporal
dynamics learning. In particular, BEVT first performs masked image modeling on
image data, and then conducts masked image modeling jointly with masked video
modeling on video data. This design is motivated by two observations: 1)
transformers learned on image datasets …

arxiv bert cv transformers video

More from arxiv.org / cs.LG updates on arXiv.org

Senior Data Engineer

@ DAZN | Hammersmith, London, United Kingdom

Sr. Data Engineer, Growth

@ Netflix | Remote, United States

Data Engineer - Remote

@ Craft | Wrocław, Lower Silesian Voivodeship, Poland

Manager, Operations Data Science

@ Binance.US | Vancouver

Senior Machine Learning Researcher for Copilot

@ GitHub | Remote - Europe

Sr. Marketing Data Analyst

@ HoneyBook | San Francisco, CA