Feb. 27, 2024, 5:47 a.m. | Yi Zong, Xipeng Qiu

cs.CV updates on arXiv.org arxiv.org

arXiv:2402.15745v1 Announce Type: cross
Abstract: The Large Vision-Language Models (LVLMs) have demonstrated great abilities in image perception and language understanding. However, existing multimodal benchmarks focus on primary perception abilities and commonsense knowledge which are insufficient to reflect the comprehensive capabilities of LVLMs. We propose GAOKAO-MM, a multimodal benchmark based on the Chinese College Entrance Examination (GAOKAO), comprising of 8 subjects and 12 types of images, such as diagrams, function graphs, maps and photos. GAOKAO-MM derives from native Chinese context and …

abstract arxiv benchmark benchmarks capabilities chinese cs.ai cs.cl cs.cv evaluation focus human image knowledge language language models language understanding multimodal multimodal models perception type understanding vision vision-language models

Lead Developer (AI)

@ Cere Network | San Francisco, US

Research Engineer

@ Allora Labs | Remote

Ecosystem Manager

@ Allora Labs | Remote

Founding AI Engineer, Agents

@ Occam AI | New York

AI Engineer Intern, Agents

@ Occam AI | US

AI Research Scientist

@ Vara | Berlin, Germany and Remote