March 22, 2024, 4:45 a.m. | Dingchen Yang, Bowen Cao, Guang Chen, Changjun Jiang

cs.CV updates on arXiv.org arxiv.org

arXiv:2403.14401v1 Announce Type: new
Abstract: Multi-modal Large Language Models (MLLMs) demonstrate remarkable success across various vision-language tasks. However, they suffer from visual hallucination, where the generated responses diverge from the provided image. Are MLLMs completely oblivious to accurate visual cues when they hallucinate? Our investigation reveals that the visual branch may simultaneously advocate both accurate and non-existent content. To address this issue, we propose Pensieve, a training-free method inspired by our observation that analogous visual hallucinations can arise among images …

abstract arxiv cs.cv generated hallucination however image investigation language language models large language large language models mllms modal multi-modal responses success tasks type vision visual visual cues

AI Engineer Intern, Agents

@ Occam AI | US

AI Research Scientist

@ Vara | Berlin, Germany and Remote

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Lead Data Modeler

@ Sherwin-Williams | Cleveland, OH, United States