all AI news
CLIP Meets Video Captioning: Concept-Aware Representation Learning Does Matter. (arXiv:2111.15162v2 [cs.CV] UPDATED)
cs.CV updates on arXiv.org arxiv.org
For video captioning, "pre-training and fine-tuning" has become a de facto
paradigm, where ImageNet Pre-training (INP) is usually used to encode the video
content, then a task-oriented network is fine-tuned from scratch to cope with
caption generation. This paper first investigates the impact of the recently
proposed CLIP (Contrastive Language-Image Pre-training) on video captioning.
Through the empirical study on INP vs. CLIP, we identify the potential
deficiencies of INP and explore the key factors for accurate description
generation. The results …
arxiv captioning clip concept cv learning representation representation learning video