all AI news
From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline
June 19, 2024, 4:42 a.m. | Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Tianhao Wu, Banghua Zhu, Joseph E. Gonzalez, Ion Stoica
cs.CL updates on arXiv.org arxiv.org
Abstract: The rapid evolution of language models has necessitated the development of more challenging benchmarks. Current static benchmarks often struggle to consistently distinguish between the capabilities of different models and fail to align with real-world user preferences. On the other hand, live crowd-sourced platforms like the Chatbot Arena collect a wide range of natural prompts and user feedback. However, these prompts vary in sophistication and the feedback cannot be applied offline to new models. In order …
abstract arena arena-hard arxiv benchmarks capabilities cs.ai cs.cl cs.lg current data development evolution fail language language models pipeline platforms quality struggle type world
More from arxiv.org / cs.CL updates on arXiv.org
ReFT: Reasoning with Reinforced Fine-Tuning
2 days, 11 hours ago |
arxiv.org
Exploring Defeasibility in Causal Reasoning
2 days, 11 hours ago |
arxiv.org
A Large Language Model Approach to Educational Survey Feedback Analysis
2 days, 11 hours ago |
arxiv.org
Jobs in AI, ML, Big Data
Data Scientist
@ Ford Motor Company | Chennai, Tamil Nadu, India
Systems Software Engineer, Graphics
@ Parallelz | Vancouver, British Columbia, Canada - Remote
Engineering Manager - Geo Engineering Team (F/H/X)
@ AVIV Group | Paris, France
Data Analyst
@ Microsoft | San Antonio, Texas, United States
Azure Data Engineer
@ TechVedika | Hyderabad, India
Senior Data & AI Threat Detection Researcher (Cortex)
@ Palo Alto Networks | Tel Aviv-Yafo, Israel