May 14, 2024, 4:47 a.m. | Guangzhao Dai, Xiangbo Shu, Wenhao Wu, Rui Yan, Jiachao Zhang

cs.CV updates on arXiv.org arxiv.org

arXiv:2401.10039v2 Announce Type: replace
Abstract: Vision-Language Models (VLMs), pre-trained on large-scale datasets, have shown impressive performance in various visual recognition tasks. This advancement paves the way for notable performance in Zero-Shot Egocentric Action Recognition (ZS-EAR). Typically, VLMs handle ZS-EAR as a global video-text matching task, which often leads to suboptimal alignment of vision and linguistic knowledge. We propose a refined approach for ZS-EAR using VLMs, emphasizing fine-grained concept-description alignment that capitalizes on the rich semantic and contextual details in egocentric …

abstract action recognition advancement arxiv cs.cv datasets ear global language language models leads performance pre-trained models recognition replace scale tasks text the way type video vision vision-language vision-language models visual vlms zero-shot

Senior Machine Learning Engineer

@ GPTZero | Toronto, Canada

ML/AI Engineer / NLP Expert - Custom LLM Development (x/f/m)

@ HelloBetter | Remote

Data Modeler

@ Synechron | Richmond, VA

AI Applications Engineer in Data&AI (She/He/They)

@ Accenture | Warsaw, Sienna 39

Master Data Specialist

@ Convatec | LBN-Lisbon

Senior Analytics Data Specialist/Especialista sénior en Análisis de Datos

@ Workday | Costa Rica