March 13, 2024, 4:47 a.m. | Fangyun Wei, Xi Chen, Lin Luo

cs.CL updates on arXiv.org arxiv.org

arXiv:2403.07872v1 Announce Type: new
Abstract: Despite their sophisticated capabilities, large language models (LLMs) encounter a major hurdle in effective assessment. This paper first revisits the prevalent evaluation method-multiple choice question answering (MCQA), which allows for straightforward accuracy measurement. Through a comprehensive evaluation of 24 models across 11 benchmarks, we highlight several potential drawbacks of MCQA, for instance, the inconsistency between the MCQA evaluation and the generation of open-ended responses in practical scenarios. In response, we introduce an RWQ-Elo rating system, …

abstract accuracy arxiv assessment benchmarks capabilities cs.cl evaluation generative highlight language language model language models large language large language model large language models llms major measurement multiple paper question question answering semantic through type

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

#13721 - Data Engineer - AI Model Testing

@ Qualitest | Miami, Florida, United States

Elasticsearch Administrator

@ ManTech | 201BF - Customer Site, Chantilly, VA