From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline | allainews.com

June 19, 2024, 4:42 a.m. | Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Tianhao Wu, Banghua Zhu, Joseph E. Gonzalez, Ion Stoica

cs.CL updates on arXiv.org arxiv.org

arXiv:2406.11939v1 Announce Type: cross
Abstract: The rapid evolution of language models has necessitated the development of more challenging benchmarks. Current static benchmarks often struggle to consistently distinguish between the capabilities of different models and fail to align with real-world user preferences. On the other hand, live crowd-sourced platforms like the Chatbot Arena collect a wide range of natural prompts and user feedback. However, these prompts vary in sophistication and the feedback cannot be applied offline to new models. In order …

abstract arena arena-hard arxiv benchmarks capabilities cs.ai cs.cl cs.lg current data development evolution fail language language models pipeline platforms quality struggle type world

More from arxiv.org / cs.CL updates on arXiv.org

MuTox: Universal MUltilingual Audio-based TOXicity Dataset and Zero-shot Detector 2 days, 11 hours ago | arxiv.org

abstract arxiv audio cs.cl +22

Can Large Language Model Summarizers Adapt to Diverse Scientific Communication Goals? 2 days, 11 hours ago | arxiv.org

abstract adapt arxiv communication +23

ReFT: Reasoning with Reinforced Fine-Tuning 2 days, 11 hours ago | arxiv.org

abstract annotations arxiv capability +22

Deductive Closure Training of Language Models for Coherence, Accuracy, and Updatability 2 days, 11 hours ago | arxiv.org

abstract accuracy arxiv cs.cl +13

Exploring Defeasibility in Causal Reasoning 2 days, 11 hours ago | arxiv.org

abstract arxiv causal causal reasoning +7

Can Large Language Models Follow Concept Annotation Guidelines? A Case Study on Scientific and Financial … 2 days, 11 hours ago | arxiv.org

abstract annotation arxiv capacity +26

Theory of Mind for Multi-Agent Collaboration via Large Language Models 2 days, 11 hours ago | arxiv.org

abstract agent agents arxiv +28

Enhancing Text-based Knowledge Graph Completion with Zero-Shot Large Language Models: A Focus on Semantic Enhancement 2 days, 11 hours ago | arxiv.org

arxiv cs.ai cs.cl focus +12

A Large Language Model Approach to Educational Survey Feedback Analysis 2 days, 11 hours ago | arxiv.org

abstract analysis arxiv capabilities +27

Data Scientist

@ Ford Motor Company | Chennai, Tamil Nadu, India

View on ai-jobs.net

Systems Software Engineer, Graphics

@ Parallelz | Vancouver, British Columbia, Canada - Remote

View on ai-jobs.net

Engineering Manager - Geo Engineering Team (F/H/X)

@ AVIV Group | Paris, France

View on ai-jobs.net

Data Analyst

@ Microsoft | San Antonio, Texas, United States

View on ai-jobs.net

Azure Data Engineer

@ TechVedika | Hyderabad, India

View on ai-jobs.net

Senior Data & AI Threat Detection Researcher (Cortex)

@ Palo Alto Networks | Tel Aviv-Yafo, Israel

View on ai-jobs.net