June 19, 2024, 4:42 a.m. | Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Tianhao Wu, Banghua Zhu, Joseph E. Gonzalez, Ion Stoica

cs.CL updates on arXiv.org arxiv.org

arXiv:2406.11939v1 Announce Type: cross
Abstract: The rapid evolution of language models has necessitated the development of more challenging benchmarks. Current static benchmarks often struggle to consistently distinguish between the capabilities of different models and fail to align with real-world user preferences. On the other hand, live crowd-sourced platforms like the Chatbot Arena collect a wide range of natural prompts and user feedback. However, these prompts vary in sophistication and the feedback cannot be applied offline to new models. In order …

abstract arena arena-hard arxiv benchmarks capabilities cs.ai cs.cl cs.lg current data development evolution fail language language models pipeline platforms quality struggle type world

Data Scientist

@ Ford Motor Company | Chennai, Tamil Nadu, India

Systems Software Engineer, Graphics

@ Parallelz | Vancouver, British Columbia, Canada - Remote

Engineering Manager - Geo Engineering Team (F/H/X)

@ AVIV Group | Paris, France

Data Analyst

@ Microsoft | San Antonio, Texas, United States

Azure Data Engineer

@ TechVedika | Hyderabad, India

Senior Data & AI Threat Detection Researcher (Cortex)

@ Palo Alto Networks | Tel Aviv-Yafo, Israel