all AI news
Rethinking Generative Large Language Model Evaluation for Semantic Comprehension
March 13, 2024, 4:47 a.m. | Fangyun Wei, Xi Chen, Lin Luo
cs.CL updates on arXiv.org arxiv.org
Abstract: Despite their sophisticated capabilities, large language models (LLMs) encounter a major hurdle in effective assessment. This paper first revisits the prevalent evaluation method-multiple choice question answering (MCQA), which allows for straightforward accuracy measurement. Through a comprehensive evaluation of 24 models across 11 benchmarks, we highlight several potential drawbacks of MCQA, for instance, the inconsistency between the MCQA evaluation and the generation of open-ended responses in practical scenarios. In response, we introduce an RWQ-Elo rating system, …
abstract accuracy arxiv assessment benchmarks capabilities cs.cl evaluation generative highlight language language model language models large language large language model large language models llms major measurement multiple paper question question answering semantic through type
More from arxiv.org / cs.CL updates on arXiv.org
Jobs in AI, ML, Big Data
Data Architect
@ University of Texas at Austin | Austin, TX
Data ETL Engineer
@ University of Texas at Austin | Austin, TX
Lead GNSS Data Scientist
@ Lurra Systems | Melbourne
Senior Machine Learning Engineer (MLOps)
@ Promaton | Remote, Europe
#13721 - Data Engineer - AI Model Testing
@ Qualitest | Miami, Florida, United States
Elasticsearch Administrator
@ ManTech | 201BF - Customer Site, Chantilly, VA