April 16, 2024, 4:45 a.m. | Yutaro Yamada, Yingtian Tang, Yoyo Zhang, Ilker Yildirim

cs.LG updates on arXiv.org arxiv.org

arXiv:2212.12043v2 Announce Type: replace-cross
Abstract: Large-scale vision-language models such as CLIP have shown impressive performance on zero-shot image classification and image-to-text retrieval. However, such performance does not realize in tasks that require a finer-grained correspondence between vision and language, such as Visual Question Answering (VQA). As a potential cause of the difficulty of applying these models to VQA and similar tasks, we report an interesting phenomenon of vision-language models, which we call the Concept Association Bias (CAB). We find that …

abstract arxiv association bias classification clip concept cs.cl cs.cv cs.lg however image image-to-text language language models performance question question answering retrieval scale tasks text type vision vision-language models visual vqa zero-shot

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

Sr. VBI Developer II

@ Atos | Texas, US, 75093

Wealth Management - Data Analytics Intern/Co-op Fall 2024

@ Scotiabank | Toronto, ON, CA