all AI news
Evaluating Large Language Models at Evaluating Instruction Following
April 17, 2024, 4:43 a.m. | Zhiyuan Zeng, Jiatong Yu, Tianyu Gao, Yu Meng, Tanya Goyal, Danqi Chen
cs.LG updates on arXiv.org arxiv.org
Abstract: As research in large language models (LLMs) continues to accelerate, LLM-based evaluation has emerged as a scalable and cost-effective alternative to human evaluations for comparing the ever increasing list of models. This paper investigates the efficacy of these ``LLM evaluators'', particularly in using them to assess instruction following, a metric that gauges how closely generated text adheres to the given instruction. We introduce a challenging meta-evaluation benchmark, LLMBar, designed to test the ability of an …
abstract arxiv cost cs.cl cs.lg evaluation ever human language language models large language large language models list llm llms paper research scalable them type
More from arxiv.org / cs.LG updates on arXiv.org
Jobs in AI, ML, Big Data
Data Architect
@ University of Texas at Austin | Austin, TX
Data ETL Engineer
@ University of Texas at Austin | Austin, TX
Lead GNSS Data Scientist
@ Lurra Systems | Melbourne
Senior Machine Learning Engineer (MLOps)
@ Promaton | Remote, Europe
Research Scientist (Computer Science)
@ Nanyang Technological University | NTU Main Campus, Singapore
Intern - Sales Data Management
@ Deliveroo | Dubai, UAE (Main Office)