all AI news
Anchor Points: Benchmarking Models with Much Fewer Examples
Feb. 20, 2024, 5:52 a.m. | Rajan Vivek, Kawin Ethayarajh, Diyi Yang, Douwe Kiela
cs.CL updates on arXiv.org arxiv.org
Abstract: Modern language models often exhibit powerful but brittle behavior, leading to the development of larger and more diverse benchmarks to reliably assess their behavior. Here, we suggest that model performance can be benchmarked and elucidated with much smaller evaluation sets. We first show that in six popular language classification benchmarks, model confidence in the correct class on many pairs of points is strongly correlated across models. We build upon this phenomenon to propose Anchor Point …
abstract anchor arxiv behavior benchmarking benchmarks cs.cl development diverse evaluation examples language language models modern performance popular show six type
More from arxiv.org / cs.CL updates on arXiv.org
Jobs in AI, ML, Big Data
Software Engineer for AI Training Data (School Specific)
@ G2i Inc | Remote
Software Engineer for AI Training Data (Python)
@ G2i Inc | Remote
Software Engineer for AI Training Data (Tier 2)
@ G2i Inc | Remote
Data Engineer
@ Lemon.io | Remote: Europe, LATAM, Canada, UK, Asia, Oceania
Artificial Intelligence – Bioinformatic Expert
@ University of Texas Medical Branch | Galveston, TX
Lead Developer (AI)
@ Cere Network | San Francisco, US