When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards | allainews.com

Feb. 6, 2024, 5:44 a.m. | Norah Alzahrani Hisham Abdullah Alyahya Yazeed Alnumay Sultan Alrashed Shaykhah Alsubaie Yusef Almushaykeh

cs.LG updates on arXiv.org arxiv.org

Large Language Model (LLM) leaderboards based on benchmark rankings are regularly used to guide practitioners in model selection. Often, the published leaderboard rankings are taken at face value - we show this is a (potentially costly) mistake. Under existing leaderboards, the relative performance of LLMs is highly sensitive to (often minute) details. We show that for popular multiple choice question benchmarks (e.g. MMLU) minor perturbations to the benchmark, such as changing the order of choices or the method of answer …

benchmark benchmarks cs.ai cs.cl cs.lg face guide language language model large language large language model leaderboard llm llms model selection performance rankings sensitivity show targets value

More from arxiv.org / cs.LG updates on arXiv.org

Consistent3D: Towards Consistent High-Fidelity Text-to-3D Generation with Deterministic Sampling Prior an hour ago | arxiv.org

arxiv consistent cs.cv cs.lg +6

Machine-learned models for magnetic materials an hour ago | arxiv.org

abstract arxiv autoencoder cond-mat.mtrl-sci +17

Revisiting RIP guarantees for sketching operators on mixture models an hour ago | arxiv.org

abstract alternative analysis arxiv +9

Non-Intrusive Speech Intelligibility Prediction for Hearing Aids using Whisper and Metadata an hour ago | arxiv.org

abstract accuracy arxiv assessment +16

Getting More for Less: Using Weak Labels and AV-Mixup for Robust Audio-Visual Speaker Verification an hour ago | arxiv.org

abstract arxiv audio cs.cv +18

Neural-network quantum state study of the long-range antiferromagnetic Ising chain an hour ago | arxiv.org

abstract arxiv boltzmann cond-mat.quant-gas +12

Prediction Risk and Estimation Risk of the Ridgeless Least Squares Estimator under General Assumptions on … an hour ago | arxiv.org

abstract arxiv assumptions cs.lg +22

Vortex Feature Positioning: Bridging Tabular IIoT Data and Image-Based Deep Learning an hour ago | arxiv.org

abstract arxiv cs.cv cs.lg +19

Provably Efficient Exploration in Quantum Reinforcement Learning with Logarithmic Worst-Case Regret an hour ago | arxiv.org

abstract algorithms arxiv attention +20

Senior Machine Learning Engineer

@ GPTZero | Toronto, Canada

View on ai-jobs.net

Customer Data Analyst with Spanish

@ Michelin | Voluntari

View on ai-jobs.net

HC Data Analyst - Senior

@ Leidos | 1662 Intelligence Community Campus - Bethesda MD

View on ai-jobs.net

Healthcare Research & Data Analyst- Infectious, Niche, Rare Disease

@ Clarivate | Remote (121- Massachusetts)

View on ai-jobs.net

Data Analyst (maternity leave cover)

@ Clarivate | R155-Belgrade

View on ai-jobs.net

Sales Enablement Data Analyst (Remote)

@ CrowdStrike | USA TX Remote

View on ai-jobs.net