all AI news
Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models
April 30, 2024, 4:50 a.m. | Pat Verga, Sebastian Hofstatter, Sophia Althammer, Yixuan Su, Aleksandra Piktus, Arkady Arkhangorodsky, Minjie Xu, Naomi White, Patrick Lewis
cs.CL updates on arXiv.org arxiv.org
Abstract: As Large Language Models (LLMs) have become more advanced, they have outpaced our abilities to accurately evaluate their quality. Not only is finding data to adequately probe particular model properties difficult, but evaluating the correctness of a model's freeform generation alone is a challenge. To address this, many evaluations now rely on using LLMs themselves as judges to score the quality of outputs from other LLMs. Evaluations most commonly use a single large model like …
abstract advanced arxiv become cs.ai cs.cl data diverse judges language language models large language large language models llm llms panel probe quality type
More from arxiv.org / cs.CL updates on arXiv.org
ALBA: Adaptive Language-based Assessments for Mental Health
2 days, 17 hours ago |
arxiv.org
PACE: Improving Prompt with Actor-Critic Editing for Large Language Model
2 days, 17 hours ago |
arxiv.org
Jobs in AI, ML, Big Data
Software Engineer for AI Training Data (School Specific)
@ G2i Inc | Remote
Software Engineer for AI Training Data (Python)
@ G2i Inc | Remote
Software Engineer for AI Training Data (Tier 2)
@ G2i Inc | Remote
Data Engineer
@ Lemon.io | Remote: Europe, LATAM, Canada, UK, Asia, Oceania
Artificial Intelligence – Bioinformatic Expert
@ University of Texas Medical Branch | Galveston, TX
Lead Developer (AI)
@ Cere Network | San Francisco, US