Web: http://arxiv.org/abs/2209.06430

Sept. 15, 2022, 1:13 a.m. | Hongwei Xue, Yuchong Sun, Bei Liu, Jianlong Fu, Ruihua Song, Houqiang Li, Jiebo Luo

cs.CV updates on arXiv.org arxiv.org

The pre-trained image-text models, like CLIP, have demonstrated the strong
power of vision-language representation learned from a large scale of
web-collected image-text data. In light of the well-learned visual features,
some existing works transfer image representation to video domain and achieve
good results. However, how to utilize image-language pre-trained model (e.g.,
CLIP) for video-language pre-training (post-pretraining) is still under
explored. In this paper, we investigate two questions: 1) what are the factors
hindering post-pretraining CLIP to further improve the performance …

alignment arxiv clip image language representation text video

More from arxiv.org / cs.CV updates on arXiv.org

Postdoctoral Fellow: ML for autonomous materials discovery

@ Lawrence Berkeley National Lab | Berkeley, CA

Research Scientists

@ ODU Research Foundation | Norfolk, Virginia

Embedded Systems Engineer (Robotics)

@ Neo Cybernetica | Bedford, New Hampshire

2023 Luis J. Alvarez and Admiral Grace M. Hopper Postdoc Fellowship in Computing Sciences

@ Lawrence Berkeley National Lab | San Francisco, CA

Senior Manager Data Scientist

@ NAV | Remote, US

Senior AI Research Scientist

@ Earth Species Project | Remote anywhere

Research Fellow- Center for Security and Emerging Technology (Multiple Opportunities)

@ University of California Davis | Washington, DC

Staff Fellow - Data Scientist

@ U.S. FDA/Center for Devices and Radiological Health | Silver Spring, Maryland

Staff Fellow - Senior Data Engineer

@ U.S. FDA/Center for Devices and Radiological Health | Silver Spring, Maryland

Computational Linguist

@ Constant Contact | Waterloo, ON/Hybrid, ON

Senior Data Scientist

@ Advian | Espoo, Finland

Data Scientist, EU In-stock Management

@ Gopuff | London, England