April 12, 2024, 5:08 a.m. | /u/KennyMcKormick_

Machine Learning www.reddit.com

Paper: [https://arxiv.org/abs/2403.20330](https://arxiv.org/abs/2403.20330)

Evaluation Code: [https://github.com/open-compass/VLMEvalKit](https://github.com/open-compass/VLMEvalKit)

Abstract:

Large vision-language models (LVLMs) have recently achieved rapid progress, sparking numerous studies to evaluate their multi-modal capabilities. However, we dig into current evaluation works and identify two primary issues: 1) Visual content is unnecessary for many samples. The answers can be directly inferred from the questions and options, or the world knowledge embedded in LLMs. This phenomenon is prevalent across current benchmarks. For instance, GeminiPro achieves 42.9% on the MMMU benchmark without any visual …

abstract capabilities current embedded evaluation however identify knowledge language language models llms machinelearning modal multi-modal progress questions samples studies vision vision-language models visual world

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

C003549 Data Analyst (NS) - MON 13 May

@ EMW, Inc. | Braine-l'Alleud, Wallonia, Belgium

Marketing Decision Scientist

@ Meta | Menlo Park, CA | New York City