all AI news
OST: Refining Text Knowledge with Optimal Spatio-Temporal Descriptor for General Video Recognition
March 29, 2024, 4:46 a.m. | Tongjia Chen, Hongshan Yu, Zhengeng Yang, Zechuan Li, Wei Sun, Chen Chen
cs.CV updates on arXiv.org arxiv.org
Abstract: Due to the resource-intensive nature of training vision-language models on expansive video data, a majority of studies have centered on adapting pre-trained image-language models to the video domain. Dominant pipelines propose to tackle the visual discrepancies with additional temporal learners while overlooking the substantial discrepancy for web-scaled descriptive narratives and concise action category names, leading to less distinct semantic space and potential performance limitations. In this work, we prioritize the refinement of text knowledge to …
abstract arxiv cs.cv data domain general image knowledge language language models nature pipelines recognition studies temporal text training type video video data vision vision-language models visual
More from arxiv.org / cs.CV updates on arXiv.org
Jobs in AI, ML, Big Data
Data Architect
@ University of Texas at Austin | Austin, TX
Data ETL Engineer
@ University of Texas at Austin | Austin, TX
Lead GNSS Data Scientist
@ Lurra Systems | Melbourne
Senior Machine Learning Engineer (MLOps)
@ Promaton | Remote, Europe
Senior Principal, Product Strategy Operations, Cloud Data Analytics
@ Google | Sunnyvale, CA, USA; Austin, TX, USA
Data Scientist - HR BU
@ ServiceNow | Hyderabad, India