March 1, 2024, 5:47 a.m. | Hao Cheng, Erjia Xiao, Renjing Xu

cs.CV updates on arXiv.org arxiv.org

arXiv:2402.19150v1 Announce Type: new
Abstract: Large Multimodal Models (LMMs) rely on pre-trained Vision Language Models (VLMs) and Large Language Models (LLMs) to perform amazing emergent abilities on various multimodal tasks in the joint space of vision and language. However, the Typographic Attack, which shows disruption to VLMs, has also been certified as a security vulnerability to LMMs. In this work, we first comprehensively investigate the distractibility of LMMs by typography. In particular, we introduce the Typographic Dataset designed to evaluate …

abstract arxiv attacks cs.cv disruption language language models large language large language models large multimodal models llms lmms multimodal multimodal models prompts shows space tasks type vision vlms

Lead Developer (AI)

@ Cere Network | San Francisco, US

Research Engineer

@ Allora Labs | Remote

Ecosystem Manager

@ Allora Labs | Remote

Founding AI Engineer, Agents

@ Occam AI | New York

AI Engineer Intern, Agents

@ Occam AI | US

AI Research Scientist

@ Vara | Berlin, Germany and Remote