March 29, 2024, 4:46 a.m. | Tongjia Chen, Hongshan Yu, Zhengeng Yang, Zechuan Li, Wei Sun, Chen Chen

cs.CV updates on arXiv.org arxiv.org

arXiv:2312.00096v2 Announce Type: replace
Abstract: Due to the resource-intensive nature of training vision-language models on expansive video data, a majority of studies have centered on adapting pre-trained image-language models to the video domain. Dominant pipelines propose to tackle the visual discrepancies with additional temporal learners while overlooking the substantial discrepancy for web-scaled descriptive narratives and concise action category names, leading to less distinct semantic space and potential performance limitations. In this work, we prioritize the refinement of text knowledge to …

abstract arxiv cs.cv data domain general image knowledge language language models nature pipelines recognition studies temporal text training type video video data vision vision-language models visual

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

Senior Principal, Product Strategy Operations, Cloud Data Analytics

@ Google | Sunnyvale, CA, USA; Austin, TX, USA

Data Scientist - HR BU

@ ServiceNow | Hyderabad, India