Inadequacies of Large Language Model Benchmarks in the Era of Generative Artificial Intelligence | allainews.com

Feb. 20, 2024, 5:52 a.m. | Timothy R. McIntosh, Teo Susnjak, Tong Liu, Paul Watters, Malka N. Halgamuge

cs.CL updates on arXiv.org arxiv.org

arXiv:2402.09880v1 Announce Type: cross
Abstract: The rapid rise in popularity of Large Language Models (LLMs) with emerging capabilities has spurred public curiosity to evaluate and compare different LLMs, leading many researchers to propose their LLM benchmarks. Noticing preliminary inadequacies in those benchmarks, we embarked on a study to critically assess 23 state-of-the-art LLM benchmarks, using our novel unified evaluation framework through the lenses of people, process, and technology, under the pillars of functionality and security. Our research uncovered significant limitations, …

abstract artificial artificial intelligence arxiv benchmarks capabilities cs.ai cs.cl cs.cy cs.hc curiosity generative generative artificial intelligence intelligence language language model language models large language large language model large language models llm llm benchmarks llms public researchers study type

More from arxiv.org / cs.CL updates on arXiv.org

Linear Alignment: A Closed-form Solution for Aligning Human Preferences without Tuning and Feedback 17 hours ago | arxiv.org

alignment arxiv cs.cl feedback +5

Can Language Model Moderators Improve the Health of Online Discourse? 17 hours ago | arxiv.org

abstract arxiv communities conversational +19

R-Tuning: Instructing Large Language Models to Say `I Don't Know' 17 hours ago | arxiv.org

arxiv cs.cl language language models +3

On-the-Fly Fusion of Large Language Models and Machine Translation 17 hours ago | arxiv.org

abstract arxiv cs.cl data +12

Can LLMs Grade Short-Answer Reading Comprehension Questions : An Empirical Study with a Novel Dataset 17 hours ago | arxiv.org

abstract arxiv assessment cs.ai +16

Making Retrieval-Augmented Language Models Robust to Irrelevant Context 17 hours ago | arxiv.org

abstract arxiv context cs.ai +14

RA-DIT: Retrieval-Augmented Dual Instruction Tuning 17 hours ago | arxiv.org

abstract arxiv build cs.ai +19

Bengali Fake Reviews: A Benchmark Dataset and Detection System 17 hours ago | arxiv.org

abstract arxiv benchmark businesses +16

How far is Language Model from 100% Few-shot Named Entity Recognition in Medical Domain 17 hours ago | arxiv.org

abstract arxiv capabilities cs.cl +14

Lead Developer (AI)

@ Cere Network | San Francisco, US

View on ai-jobs.net

Research Engineer

@ Allora Labs | Remote

View on ai-jobs.net

Ecosystem Manager

@ Allora Labs | Remote

View on ai-jobs.net

Founding AI Engineer, Agents

@ Occam AI | New York

View on ai-jobs.net

AI Engineer Intern, Agents

@ Occam AI | US

View on ai-jobs.net

AI Research Scientist

@ Vara | Berlin, Germany and Remote

View on ai-jobs.net