Aug. 23, 2022, 1:16 a.m. | Bang Yang, Tong Zhang, Yuexian Zou

cs.CV updates on arXiv.org arxiv.org

For video captioning, "pre-training and fine-tuning" has become a de facto
paradigm, where ImageNet Pre-training (INP) is usually used to encode the video
content, then a task-oriented network is fine-tuned from scratch to cope with
caption generation. This paper first investigates the impact of the recently
proposed CLIP (Contrastive Language-Image Pre-training) on video captioning.
Through the empirical study on INP vs. CLIP, we identify the potential
deficiencies of INP and explore the key factors for accurate description
generation. The results …

arxiv captioning clip concept cv learning representation representation learning video

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

Lead Data Engineer

@ JPMorgan Chase & Co. | Jersey City, NJ, United States

Senior Machine Learning Engineer

@ TELUS | Vancouver, BC, CA

CT Technologist - Ambulatory Imaging - PRN

@ Duke University | Morriville, NC, US, 27560

BH Data Analyst

@ City of Philadelphia | Philadelphia, PA, United States