Feb. 29, 2024, 5:45 a.m. | Koki Maeda, Shuhei Kurita, Taiki Miyanishi, Naoaki Okazaki

cs.CV updates on arXiv.org arxiv.org

arXiv:2402.17969v1 Announce Type: new
Abstract: Given the accelerating progress of vision and language modeling, accurate evaluation of machine-generated image captions remains critical. In order to evaluate captions more closely to human preferences, metrics need to discriminate between captions of varying quality and content. However, conventional metrics fail short of comparing beyond superficial matches of words or embedding similarities; thus, they still need improvement. This paper presents VisCE$^2$, a vision language model-based caption evaluation method. Our method focuses on visual context, …

abstract arxiv captions context cs.ai cs.cv evaluation extraction generated human image language language model machine metrics modeling progress quality type vision vision language model visual

Artificial Intelligence – Bioinformatic Expert

@ University of Texas Medical Branch | Galveston, TX

Lead Developer (AI)

@ Cere Network | San Francisco, US

Research Engineer

@ Allora Labs | Remote

Ecosystem Manager

@ Allora Labs | Remote

Founding AI Engineer, Agents

@ Occam AI | New York

AI Engineer Intern, Agents

@ Occam AI | US