March 22, 2024, 4:45 a.m. | Dingchen Yang, Bowen Cao, Guang Chen, Changjun Jiang

cs.CV updates on arXiv.org arxiv.org

arXiv:2403.14401v1 Announce Type: new
Abstract: Multi-modal Large Language Models (MLLMs) demonstrate remarkable success across various vision-language tasks. However, they suffer from visual hallucination, where the generated responses diverge from the provided image. Are MLLMs completely oblivious to accurate visual cues when they hallucinate? Our investigation reveals that the visual branch may simultaneously advocate both accurate and non-existent content. To address this issue, we propose Pensieve, a training-free method inspired by our observation that analogous visual hallucinations can arise among images …

abstract arxiv cs.cv generated hallucination however image investigation language language models large language large language models mllms modal multi-modal responses success tasks type vision visual visual cues

Artificial Intelligence – Bioinformatic Expert

@ University of Texas Medical Branch | Galveston, TX

Lead Developer (AI)

@ Cere Network | San Francisco, US

Research Engineer

@ Allora Labs | Remote

Ecosystem Manager

@ Allora Labs | Remote

Founding AI Engineer, Agents

@ Occam AI | New York

AI Engineer Intern, Agents

@ Occam AI | US