all AI news
CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment. (arXiv:2209.06430v2 [cs.CV] UPDATED)
cs.CV updates on arXiv.org arxiv.org
The pre-trained image-text models, like CLIP, have demonstrated the strong
power of vision-language representation learned from a large scale of
web-collected image-text data. In light of the well-learned visual features,
some existing works transfer image representation to video domain and achieve
good results. However, how to utilize image-language pre-trained model (e.g.,
CLIP) for video-language pre-training (post-pretraining) is still under
explored. In this paper, we investigate two questions: 1) what are the factors
hindering post-pretraining CLIP to further improve the performance …
alignment arxiv clip image language representation text video