March 29, 2024, 4:46 a.m. | Sijie Cheng, Zhicheng Guo, Jingwen Wu, Kechen Fang, Peng Li, Huaping Liu, Yang Liu

cs.CV updates on arXiv.org arxiv.org

arXiv:2311.15596v2 Announce Type: replace
Abstract: Vision-language models (VLMs) have recently shown promising results in traditional downstream tasks. Evaluation studies have emerged to assess their abilities, with the majority focusing on the third-person perspective, and only a few addressing specific tasks from the first-person perspective. However, the capability of VLMs to "think" from a first-person perspective, a crucial attribute for advancing autonomous agents and robotics, remains largely unexplored. To bridge this research gap, we introduce EgoThink, a novel visual question-answering benchmark …

abstract arxiv capability cs.cl cs.cv evaluation however language language models person perspective results specific tasks studies tasks think thinking type vision vision-language models vlms

Founding AI Engineer, Agents

@ Occam AI | New York

AI Engineer Intern, Agents

@ Occam AI | US

AI Research Scientist

@ Vara | Berlin, Germany and Remote

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Machine Learning Research Scientist

@ d-Matrix | San Diego, Ca