all AI news
Joint Visual and Text Prompting for Improved Object-Centric Perception with Multimodal Large Language Models
April 9, 2024, 4:50 a.m. | Songtao Jiang, Yan Zhang, Chenyi Zhou, Yeying Jin, Yang Feng, Jian Wu, Zuozhu Liu
cs.CL updates on arXiv.org arxiv.org
Abstract: Multimodal Large Language Models (MLLMs) such as GPT-4V and Gemini Pro face challenges in achieving human-level perception in Visual Question Answering (VQA), particularly in object-oriented perception tasks which demand fine-grained understanding of object identities, locations or attributes, as indicated by empirical findings. This is mainly due to their limited capability to effectively integrate complex visual cues with textual information and potential object hallucinations. In this paper, we present a novel approach, Joint Visual and Text …
abstract arxiv challenges cs.cl demand face fine-grained gemini gemini pro gpt gpt-4v human language language models large language large language models locations mllms multimodal object object-oriented perception prompting question question answering tasks text type understanding visual vqa
More from arxiv.org / cs.CL updates on arXiv.org
Jobs in AI, ML, Big Data
AI Research Scientist
@ Vara | Berlin, Germany and Remote
Data Architect
@ University of Texas at Austin | Austin, TX
Data ETL Engineer
@ University of Texas at Austin | Austin, TX
Lead GNSS Data Scientist
@ Lurra Systems | Melbourne
Senior Machine Learning Engineer (MLOps)
@ Promaton | Remote, Europe
Data Analyst (Digital Business Analyst)
@ Activate Interactive Pte Ltd | Singapore, Central Singapore, Singapore