Feb. 6, 2024, 5:44 a.m. | Norah Alzahrani Hisham Abdullah Alyahya Yazeed Alnumay Sultan Alrashed Shaykhah Alsubaie Yusef Almushaykeh

cs.LG updates on arXiv.org arxiv.org

Large Language Model (LLM) leaderboards based on benchmark rankings are regularly used to guide practitioners in model selection. Often, the published leaderboard rankings are taken at face value - we show this is a (potentially costly) mistake. Under existing leaderboards, the relative performance of LLMs is highly sensitive to (often minute) details. We show that for popular multiple choice question benchmarks (e.g. MMLU) minor perturbations to the benchmark, such as changing the order of choices or the method of answer …

benchmark benchmarks cs.ai cs.cl cs.lg face guide language language model large language large language model leaderboard llm llms model selection performance rankings sensitivity show targets value

Founding AI Engineer, Agents

@ Occam AI | New York

AI Engineer Intern, Agents

@ Occam AI | US

AI Research Scientist

@ Vara | Berlin, Germany and Remote

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne